Index procedure profiling for a stand-alone use case

yuanzhou commented 8 months ago

Currently we use a workaround where a Collection node is attached to the Publication via a [:USES_DATA] relationship, which in turn points to those Datasets.

Publication 3c7273660cdf9ab91a7901533b2cd9a5 shown below as an example:

Also similar handling for 77ab35880329b5932380104aa58795a4 and 72cbeb8ff605fd5017cb2666cd19dfb7.

First create a situation in a local Neo4j instance where a Publication node is generated from a large number of Datasets (over 200 for instance) as direct ancestors without using a Collection node as workaround.

Then do some entity-api and Neo4j profiling to identify the bottleneck of indexing the Publication into Elasticsearch (we can create some testing indices on DEV) with such large number of direct ancestors and ancestors and measure the total time as well as where is causing the performance issue.

Is running entity-api triggers taking too much time?
Are some of the neo4j queries slowing down the index?
What improvements can be done? Either in search-api or entity-api.

DerekFurstPitt commented 8 months ago

@yuanzhou profiles.zip

Attached is the profiler data of my latest successful run for both entity api and search api.

Entity API's profile is for individual entity-api calls. The Search API profile is a full capture of the entire indexing process.

To visualize these .prof files, I recommend using snakeviz for an effective icicle chart.

pip install snakeviz

followed by snakeviz <prof file name>

This took just under 16 hours to run to completion with no threading or caching enabled.

I have noticed that within each individual entity-api call, about half of the total time is dedicated to calls to uuid-api (for the get_hubmap_ids). My plan is to host a local instance of uuid-api as well and do some profiling there. I'm also going to try to construct a tree of all the different nodes that are reindexed from a given starting entity.

yuanzhou commented 8 months ago

@shirey @DerekFurstPitt I looked at the results (very useful) and was able to confirm the places that can be optimized to improve the efficiency, which are in line with the established tasks from each milestone:

In addition, I also identified a few other interesting places that definitely can be optimized outside the index scope in entity-api.

We will focus on the above tasks and leave the uuid-api profiling for the future sprint.

hubmapconsortium / search-api

Index procedure profiling for a stand-alone use case #748