Query execution time depends on semiology term

fepegar commented 4 years ago

Related to #2.

fepegar commented 4 years ago

This is a table (Github doesn't take CSV files) with the benchmark results. Related to #2 and #7.

benchmark_output.txt

thenineteen commented 4 years ago

these numbers are seconds? I'll have to look into this but I think once I do the manual checks and pickle the DataFrame (instead of preprocessing the excel file every time a query is made) then this issue should be resolved. Can't think of it being related to e.g. long lists in semiology dictionary as others such as epigastric and autonomous-vegetative also have long lists but don't take as long.

fepegar commented 4 years ago

As I said in #2, ideally the Excel should be read only once and the data frame cached. Then, the information can be extracted from that data frame without reading and parsing the Excel many times per query.

Also I think it would be best to have a CSV, not an Excel. Reading the Excel file with Pandas takes 140 ms, which is a lot. When I save that DF as a CSV, loading it takes 23 ms. Still quite a lot, therefore the need to cache.

thenineteen commented 4 years ago

have you seen the dataframes? check the branch under resources.

fepegar commented 4 years ago

Which data frames? What branch do you mean?

thenineteen commented 4 years ago

here SVT there is only one other branch now besides master

fepegar commented 4 years ago

Why is there a second branch anyway? Are you planning to merge with master?

thenineteen commented 4 years ago

other branch merged already - now keeping it for verbose printout to allow me to manually inspect the outputs and keep my notebooks working for now.

thenineteen commented 4 years ago

see https://github.com/fepegar/EpilepsySemiology/issues/6

fepegar commented 4 years ago

For reference: https://github.com/thenineteen/Semiology-Visualisation-Tool/blob/master/scripts/slicer_initialisation.py

fepegar commented 4 years ago

Reading the data frame once didn't help. I think the Excel file is being read many other times in the code. Actually, things are slower now for some reason. "Hypermotor" takes more than 5 minutes!

fepegar commented 4 years ago

I think the problem is that the Excel file is read many times by big_map and gifs_lat in QUERY_LATERALISATION.

thenineteen commented 4 years ago

I see you updated these, let me know what has happened to the execution times?

fepegar commented 4 years ago

So I don't see those crazy times like 5 minutes, but now they're back to 0-15 seconds. It feels like the code loops longer for some semiologies than for others. We need to find the bottleneck by debugging and/or profiling. I've never used profiling, but I think it's very useful, especially with a GUI.

thenineteen commented 4 years ago

I'd have to look up profiling too

fepegar commented 4 years ago

New benchmark.

benchmark_output.txt

thenineteen commented 4 years ago

there doesn't seem to be a relationship between the length of the semiology and time so will have to profile later

thenineteen commented 4 years ago

What are you thoughts on running the entire database once, like now as the database in complete, and adding the resulting GIF structures and values as a dictionary under resources with version number. So when we actually use the 3D slicer SVT, and tick a semiology and click on Update visualisation, instead of running query_lateralisation or query_semiology etc, it simply reads off the dictionary.

Concretely:

this would significantly enhance speed
would only work for the predefined list of semiologies with regexes in the SemioDict YAML file
would still have the current default behaviour for a custom semiology
if and when the code (e.g. mega_analysis module) and/or the Semio2Brain Database are updated, we run the entire thing again and update the GIF outputs (and bump versions)

To do this we can use the following to obtain the GIF structures:

[x] debugging_multi_postictals_neutral_also.py
[x] debugging_multi_postictals_neutral_only.py
[x] debugging_multi_terms_neutral_also.py
[x] debugging_multi_terms_neutral_only.py
[ ] an equivalent script to above reading off the remaining semiologies in semiologies_lateralised_only_default_list.txt

thenineteen commented 3 years ago

So I don't see those crazy times like 5 minutes, but now they're back to 0-15 seconds. It feels like the code loops longer for some semiologies than for others. We need to find the bottleneck by debugging and/or profiling. I've never used profiling, but I think it's very useful, especially with a GUI.

Profiling in #208

thenineteen / Semiology-Visualisation-Tool

Query execution time depends on semiology term #5