populse / populse_mia

Multiparametric Image Analysis
Other
10 stars 9 forks source link

[database management] data loading very slow for big project #308

Closed manuegrx closed 11 months ago

manuegrx commented 1 year ago

When a mia project contains several scans (approximately from 50 scans), loading the database takes a very long time (for really big project it can be several minutes).

It can be when you open the project, when you click on the data browser tab or when you click on the filter button to add a scan in a process (eventually each time that the windows "Filling the cells" appears ).

It is also very long when you are doing a search (with the filter) on the data browser tab or when you select some data in the data browser tab.

It is quite annoying when it comes to handling large projects. Maybe it is possible to find a way to speed up loading

servoz commented 1 year ago

Yes, we have this problem, but we've ignored it until now, giving priority to the development of other features.

Unless I'm mistaken, this slowness is not linked to the database itself (popusle_db) but to the GUI that displays the database items (and which is generated each time, which is obviously sub-optimal).

Clearly, this slowness will not be compatible with the use of Mia by a large number of users. We're going to have to look into this problem and resolve it, I think fairly quickly.

Honestly, I've got other priorities before this one. But it's clear that we're going to have to work on it ASAP.

Thanks for opening this ticket, it will help us to keep this issue in mind.

sapetnioc commented 1 year ago

Solving this issue requires both knowledge of populse_db and of Mia. As I know the possible performance of populse_db, I would bet that the problem comes from the way MIA uses populse_db. It is possible that the code perform too many requests. 50 subjects is very low, it must be immediate.

I can help on the populse_db side but I need to work with someone knowing the MIA side. And if it appears that I am wrong and populse_db is really the problem, I will not take it personally and will be ready to help to find an alternative solution.

servoz commented 1 year ago

Yes, we'll certainly need your help with the populse_db part.

I've known about this slowness problem for a long time and I confess I always thought we have to deal with it later, given the number of things we had to sort out so far. But it's certain that if we want a wide audience to use Mia, DataBrowser will have to be faster!!! !!!!

I'm pretty sure the problem isn't with populse_db, but with populse_mia. But I could be wrong.

Honestly, I don't have the time to look into this problem right now. But as soon as we can, I'd be happy to work with you to fix it, @sapetnioc !

servoz commented 1 year ago

Perhaps, when you have some time @manuegrx , could you give, please, a minimum case to reproduce, that will give a framework to start working on this ticket, when we have some time ...

There will certainly be the problem of the database. We could imagine putting it somewhere (the CNRS cloud? which accepts fairly large files) so that we can retrieve it and work on the same things. To do this, you could compress a project and put it somewhere (I suggest the CNRS cloud, but there are certainly other solutions).

What's certain is that this ticket is very important because, I repeat, even if Mia is a beautiful object, it will never be used if the DataBrowser is too slow...

manuegrx commented 1 year ago

Maybe we can used a open project on openNeuro ? I can find one and test it a little bit on MIA to check I have the same issue

servoz commented 1 year ago

Sure ! Good idea. That way, we won't have to worry about data storage. However, it would be nice if you could also give a minimum case to reproduce!!!!

sapetnioc commented 1 year ago

It would help a lot indeed. With the ability to reproduce the problem, I could start to investigate on my side. If the goal is only to share data between us, we could also use CATI sftp for temporary storage/transfer. But working with OpenNeuro could be a better long term option to have something to propose to people who want to test Mia. The choice is on your side.

manuegrx commented 1 year ago

You can find here a MIA project with data from the MR-ART database https://openneuro.org/datasets/ds004332/versions/1.1.0 ). (it is a filesender folder valid only until July 22, so maybe it will be better to add it to CATI sftp if you want to work on it later, I will check if my account is still working).

I run mriqc pipeline on some of the subject in order to have a bigger MIA database. This database is quite large (18G, around 50 scans in raw data and around 2000 elements in derived_data) but it is smaller than others I'm working on (up to 20 000 elements in derived data), so the slowness is less important.

If it is too big let me know and I will remove some processing (however in this case it will be more complicated to see the issue).

Here are several ways to reproduce the problem (long time to upload the database) :

  1. Unzip the MRART_test folder in the MIA project folder. Open the project in MIA (File --> open a project and select the MRART_test project ). It will take some time to display all the scans in the database

  2. Go to pipeline manager tab and then go again in data browser tab, a windows with "Filling the cells" appears and it takes again a long time.

  3. You can go in pipeline manager, select the smooth brick (in mia_processes/bricks/preprocesse/spm), right-click on the brick and select "Export all unconnect plug". Click on "filter" button to chose in_files. A windows with "Filling the cells" appears and it takes a long time to display the windows to select the data.

  4. In data browser tab, you can try to select several scans (at least 200 in this project, for bigger project even select 10 scans takes a long time) and use right-click and select "Send Document to Pipeline manager". A windows with "Filling the cells" appears and then a windows to confirms the selection appears. Each step takes also a long time

Two others ways that may be related (or not):

  1. In data browser use the search bar to select only some scan. It take a long time when the database is big
  2. In data browser, in the "Tags" menu, select add tag and try to add a new test tag. It also quite long in a big database

@sapetnioc if you need help to use MIA let me know !

servoz commented 1 year ago

Seeing what you're describing, the issue must have more to do with the way we do the display in DataBrowser than with a problem with populse_db.

I think we're going to have to take a closer look at DataBrowser's GUI work, what do you think @sapetnioc ? Do you detect that the duration is linked to interaction with populse_db (I'd be very surprised, but why not)?

servoz commented 12 months ago

Do you detect that the duration is linked to interaction with populse_db (I'd be very surprised, but why not)?

@sapetnioc ?

sapetnioc commented 11 months ago

Ok, I am a little bit late on this one. I already downloaded the data. I will try to follow @manuegrx procedure and see if I can identify a point where I could start a profiler.

sapetnioc commented 11 months ago

I probably made a mistake while uploading the data. I have a huge "data" folder looking as a BIDS structure but no project file. Could you help me to get it @manuegrx ? Sorry.

manuegrx commented 11 months ago

@sapetnioc Sorry, it seems that the link in my previous comment was not updated as I thought (it was the link for the dataset in open neuro and not the link of a filesender with a MIA project )

Here is a link for a MIA project : https://filesender.renater.fr/?s=download&token=293fa327-dcce-487b-a266-0871ec817d3e It is quite big (around 50 GB unzip), so if it is too big let me know and I will remove some data of the project

sapetnioc commented 11 months ago

I have been able to test and to start profiling. My very first test was to profile fill_cells_update_tab function that is called when going back to data management tab. First look (see joined image) shows that most of the time is spent in two builtin functions setValue() and processEvents(). This is Qt functions, they are called almost 100 000 times each. It should be easy to reduce this amount.

I did not look at the rest yet. To be continued. image

sapetnioc commented 11 months ago

By the way, here is how I got this result:

1) Identify a piece of code to profile. In that case, I started to look for the GUI message Please wait while the cells are being filled. It allowed me to quickly find fill_cells_update_table function.

2) Use profiler to register time passed in this function (and all sub-functions). I replaced the call in main_windows.py line 1704 by:

            from cProfile import Profile
            with Profile() as profile:
                self.data_browser.table_data.fill_cells_update_table()
                profile.dump_stats('/tmp/fill_cells_update_table.profile')

3) analyse the result. In just installed SnakeViz and used

snakeviz /tmp/fill_cells_update_table.profile
servoz commented 11 months ago

By the way, here is how I got this result:

1. Identify a piece of code to profile. In that case, I started to look for the GUI message `Please wait while the cells are being filled`. It allowed me to quickly find `fill_cells_update_table` function.

2. Use profiler to register time passed in this function (and all sub-functions). I replaced the call in `main_windows.py` line 1704 by:
            from cProfile import Profile
            with Profile() as profile:
                self.data_browser.table_data.fill_cells_update_table()
                profile.dump_stats('/tmp/fill_cells_update_table.profile')
3. analyse the result. In just installed SnakeViz and used
snakeviz /tmp/fill_cells_update_table.profile

thanks for the tip! this is indeed very usefull.

servoz commented 11 months ago

I have been able to test and to start profiling. My very first test was to profile fill_cells_update_tab function that is called when going back to data management tab. First look (see joined image) shows that most of the time is spent in two builtin functions setValue() and processEvents(). This is Qt functions, they are called almost 100 000 times each. It should be easy to reduce this amount.

92.75% of the time for this 2 Qt functions ... I think you just found a big part of the problem. it seems that the issue doesn't come from populse_db but rather from the way we fill in the table cells ...

sapetnioc commented 11 months ago

I finally identify the first problem and found a workaround. Adding widgets in the grid (the buttons) makes Qt being slow to recompute the layout of the whole grid. Due to the progression bar, processEvents() is called many times. Each time a complete updated layout of the grid is computed. To easily avoid this grid update but keep the progress bar, I simply hide the grid during the update calling self.setVisible(False) at the begining of fill_cells_update_table() and self.setVisible(True) at the end. This a bit strange visually for the user but much more fast. Now we have a more acceptable profiling schema. populse_db is now involved for 26.86% of the update time. image

servoz commented 11 months ago

This a bit strange visually for the user

I'm curious to see how it looks :-) if I understand correctly your fix allows to go from 20.7s to 1.69 s with the same data??? Wahoo that's great ...

sapetnioc commented 11 months ago

I let you decide if the ticket is ready to be closed. On my laptop, I have more than 90% speed-up.

sapetnioc commented 11 months ago

For the data management tab, I think it is ok visually. You see a white space instead of the grid, wait a little bit after the progress bar is gone and the grid appears.

For the filter button in the pipeline view, it is a bit more strange. For me, the window was first displayed with a wrong layout (probably due to the grid being empty) and then became bigger with the final grid.

sapetnioc commented 11 months ago

And for your information, during my tests it was much more fast when no widget (i.e. no button) was added to the grid. So it may be possible to save more time by using a simpler grid with a kind of clickable text instead of a button.

For very big datasets, it could be interesting not to put them all in the GUI. Using pagination for instance.

servoz commented 11 months ago

Oh, that's great, I've just tested and it has indeed accelerated dramatically ! I can't see the strange display because I don't have a large enough database at hand ... with your fix it's now going too fast and I don't have time to see anything ... :-)))

Thank you very much @sapetnioc for this work.

I let @manuegrx decide if we can close this ticket (I don't see the strange display with my projects ...).

I think that as far as acceleration is concerned, the objective has been achieved for now.

servoz commented 11 months ago

For very big datasets, it could be interesting not to put them all in the GUI. Using pagination for instance.

good idea to keep !

manuegrx commented 11 months ago

I've tested the solution and it's much faster than before ! Thanks ! I observed the same issue as you for the data management tab and for the filter button in the pipeline view (wrong layout for the window at first). It is a little weird but I think we can live with it !

Files selection (at least 200 for the project used for the test) is still quite long, do you think it can be improved? Or it's a totally different problem ?

sapetnioc commented 11 months ago

It is a different problem but I think it can still be improved. According to what I saw, I believe that we cannot gain much on the 25% of time for populse_db. However, the rest 75% is probably only GUI. It should be possible to accelerate again. But it could be at the cost of user experience. For instance, using pagination can drastically limit what is displayed (populse_db allow pagination if my memory is not too bad). But, as user, I am not found of pagination but prefer to have it that always wait for something I will rarely use. As I said, not using a button in the grid may be faster (to be tested) but less sexy.

manuegrx commented 11 months ago

As the main issue is solved, I propose to close this ticket and we will see with future user if it is needed to speed up again the processes and the scan selection !

(feel free to reopend this ticket if you want to work on the GUI side)