Opening an existing project is (overly) time consuming

servoz commented 5 years ago

Not yet investigated, but the opening of a project without a lot of data (15 .nii files), generated from 2 input .nii files and a not very complex pipeline (just 1 segment + 1 realign + 2 normalise + 1 coregister + 1 smooth ; all spm ; irmage_spacial_preprocessing_1 pipeline) takes 16s !!! I do not think this is related to the pipeline, but this is just to give enough details, for when this issue will be investigated!

servoz commented 5 years ago

I am reopening this issue because it seems that it was closed 2 days ago, as a side effect of the rewrite of the history, which I made in order to delete from the history the large files now followed by git-fs. The first investigation I conducted seems to indicate that following this operation, some issues have been closed (which I will reopen after some checks). The integrity of the populse_mia project does not seems to have changed (I think the SHA-1 of the commits were changed during the operations, that could explain the side effect observed!)

denisri commented 4 years ago

I had a look (well; more than a look actually I spent the full morning on it). In Populse_mia and populse_db, there are many many requests for very very small elements in the database. That's not the way a database it meant for: In the data browser, when building the data table, for each data row, each field is retreived via database.session.get_value(). This function requests the database for every tag of every data row. More precisely it performs the following:

get_document(), which in turn:
- performs a SQLAlchemy request
- builds a Document instance, which also does:
- build a dict with all field values of the row
- perform other requests for each field to get its name This document is accessed only to get one field value... I have a database with 17 documents (rows), to display it populse_db performs 2250 requests + 2250 Document instance builds (each performing copies of all fields). Before I see the data table after opening Mia, more than 6000 requests have been performed. And this is because the database cache avoids many more requests...

I could reduce the requests in the data browser (in fill_cells_update_table()) to cut off about 1100 requests just in this function, and largely improve the speed in this step, but there are many others around this function. And to do so, we perform a single (large) request, but then we have to get to many Populse_db low-level internals and even SQLAlchemy objects, because the API just does not provide methods for non-elementary requests. Either I haven't understood, or populse_db API allows only to request single values with heavy (inefficient) wrappings over them. I think this is not the correct use of a database: we must allow to get full tables from requests, otherwise I dare to say populse_db is useless. I'm speaking about populse_db v1 at least, but v2 is used the same way right now anyway and doesn't provide a cache thus may be even slower in MIA.

sapetnioc commented 4 years ago

In populse_db v2 there is no more document caching because, in my opinion, there is no generic way to build a cache without recreating a database system. However, there are some internal "caches" for the collections and fields (for instance, only Python dicts are involved in retreiving information about fields) . This can be done because this metadata is never large. It allows to speed-up some often used methods in the API. This may speed-up Mia by avoiding some SQL calls.

If necessary, it is possible to reintroduce a caching system for values but probably in a derived Database class and with a well defined use case to optimize the caching system.

There is no major API change in populse_db v2. It is possible to work either on a single field value or on a single document. But this is not the way to go in order to work on many documents. To make it simple, there is only one method to retrieve several documents : filter_documents(). All other methods to iterate over many documents or values (e.g. get_documents(), get_document_names(), etc.) are going through the same selection method. This method is based on a single SQL SELECT query whose lines are iterated with a single for loop and each row is returned as a document. A document is an instance of a collection specific class containing field names and corresponding indices (cached) and constructed with the row data. This allows to access values either via an interger index or a field name. Python generators are used to return each document therefore there is only a little overhead for each document : instanciate a predefined class passing a list to the constructor. There are a few places were optimization can be done but filter_documents() must already be efficent.

To my opinion, filling tables with possibly a lot of values should be done via a single populse_db call iterating over the selected documents. It may be necessary to add features to populse_db in order to be able to do every use cases in a single call. For instance populse_db API does not support pagination nor sorting values on selected fields.

sapetnioc commented 4 years ago

I had a quick look at the code. If necessary, it is not difficult to add an order_by=[field_name, ...] parameter to filter_documents() to sort the result according to the values of one or several fields. And if we have a very big database, I could also add pagination support. For instance a count_documents() method that would only return the number of documents matching a query (using SQL COUNT(*)). And two new parameters to filter_documents:

limit=n to limit the number of returned document to n (with SQL LIMIT)
offset=n to skip the first n documents (with SQL OFFSET)

With this API a GUI could set a threshold on the number of documents. Above this threshold, not all documents would be displayed but pagination would be used to navigate through the results.

denisri commented 4 years ago

I could simplify my code using filter_documents() and still reduce the number of requests (150 instead of 2250 in my database). There are still requests performance issues, especially in the use of CapsulEngine (in capsulengine branch for now) which also performs lots of requests: actually more than 4000 are performed when opening MIA before the user GUI is ready using the capsulengine branch, compared to only 38 on master. The API of populse_db doesn't allow to make requests on selected columns, right ? It always retreives full documents (I'm not sure this is a major issue by now).

sapetnioc commented 4 years ago

To date the selected columns in the query are taken from an internal cache containing all the fields. But internally, there is an explicit list of columns so it is not a big deal to add a fields=[field_name, ...] parameter to filter_documents if necessary.

servoz commented 3 years ago

This ticket seems to have been processed

populse / populse_mia

Opening an existing project is (overly) time consuming #49