Closed yuanzhou closed 7 months ago
@yuanzhou profiles.zip
Attached is the profiler data of my latest successful run for both entity api and search api.
Entity API's profile is for individual entity-api calls. The Search API profile is a full capture of the entire indexing process.
To visualize these .prof files, I recommend using snakeviz
for an effective icicle chart.
pip install snakeviz
followed by snakeviz <prof file name>
This took just under 16 hours to run to completion with no threading or caching enabled.
I have noticed that within each individual entity-api call, about half of the total time is dedicated to calls to uuid-api (for the get_hubmap_ids). My plan is to host a local instance of uuid-api as well and do some profiling there. I'm also going to try to construct a tree of all the different nodes that are reindexed from a given starting entity.
@shirey @DerekFurstPitt I looked at the results (very useful) and was able to confirm the places that can be optimized to improve the efficiency, which are in line with the established tasks from each milestone:
In addition, I also identified a few other interesting places that definitely can be optimized outside the index scope in entity-api.
We will focus on the above tasks and leave the uuid-api profiling for the future sprint.
Currently we use a workaround where a Collection node is attached to the Publication via a
[:USES_DATA]
relationship, which in turn points to those Datasets.Publication
3c7273660cdf9ab91a7901533b2cd9a5
shown below as an example:Also similar handling for
77ab35880329b5932380104aa58795a4
and72cbeb8ff605fd5017cb2666cd19dfb7
.First create a situation in a local Neo4j instance where a Publication node is generated from a large number of Datasets (over 200 for instance) as direct ancestors without using a Collection node as workaround.
Then do some entity-api and Neo4j profiling to identify the bottleneck of indexing the Publication into Elasticsearch (we can create some testing indices on DEV) with such large number of direct ancestors and ancestors and measure the total time as well as where is causing the performance issue.