odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
806 stars 260 forks source link

Search performance using external repository proxy #3344

Closed lpalashevski closed 3 years ago

lpalashevski commented 4 years ago

Purpose

Open lineage services rely on asset search functionality as entry point. While doing integration testing we noticed poor search performance results and we would like to share the results so we can discus optimisation or potential improvement options we have.

Testing environment and configuration

Server A - OMAG platform hosting one server with following subsystems enabled

Local Repository (using JanusGraph embedded and Cassandra backend store) Access Services: AssetCatalog (default search types enabled)

Server B - OMAG platform hosting one server External repository proxy (using IBM IGC connector)

Server C - UI platform hosting Egeria UI server (configured to access AssetCatalog on Server A)

Server A and Server B use same Kafka event bus cluster and are members of the same cohort and exchanged registration events successfully.

The servers used are RHEL virtual machines with 6G mem., 2 cpu and JVM options -Xms3G - Xmx3G

External repository content

IGC Type OMRS Type Items total
data_file DataFile 3366
data_file_field TabularColumn 61115
database_column RelationalColumn 80784
database_table RelationalTable 3456
dsjob Process ? 10360
terms GlossaryTerm 2864

Note the numbers are from test instance of one of the enterprise repositories

Details on what was measured

In order to capture internal metrics quickest way was to add extra logging on Server A by logging execution times on OMRS level, so we looked at

org.odpi.openmetadata.repositoryservices.enterprise.repositoryconnector.control.ParallelFederationControl

and captured execution times from below method:

executor.issueRequestToRepository(metadataCollectionId, metadataCollection);

This way we get an idea on how the original request coming form AssetCatalog OMAS service are spread between the repositories (local and remote).

Results for specific search use-cases will follow.

marius-patrascu commented 4 years ago

Below are the results obtained during measurements.

Queried String Requested Type Time (ms)   Notes
    Local repository Remote repository  
NAME RelationalColumn 1444 69320  
NAME RelationalColumn 1649 69104  
NA* RelationalColumn 101 13661 empty result returned
N RelationalColumn 3369 300026 timeout
NAME TabularColumn 867 73346  
NAME RelationalTable 182 953  
NAME DataFile 1010 499 empty result returned
NAME Process 366 16 empty result returned
NAME GlossaryTerm 613 3829  
         
cmgrote commented 4 years ago

I think we talked about this on a previous call, but one quick thing to try would be to change the page sizes used against the repositories. I think by default they'll use a page size of 500-1000 (ie. if no pageSize is provided, they'll use the default maximum which is one of these). The pageSize is a parameter to every find... method, so should be fairly simple to limit it that way?

Doing some minimal testing on a tiny environment of my own (probably not even the minimum hardware spec for IGC, with 1350 terms in it), I can see that running the same search with a page size of 500 takes 7-10 seconds vs. a page size of 10 consistently taking 0.5 seconds (this is just the IGC timing for the response, not translating all of that into Egeria objects, but I believe the IGC response time is the bulk of the overall response time). For IGC alone the payload difference is 110KB for the 500 results vs 3.5KB for the 10 results. (After translating into Egeria objects I'd estimate that the payloads will likely be about 10x these sizes: so > 1MB for the 500 results vs 30KB for the 10 results.)

I'd be in favour of dropping down the page size significantly (taking a page from Google's book: only the first 10 results?), but also adding a property to the search results bean that gives an indication of the approximate total number of results -- that would allow us to give an indication to users that they might want to narrow their search if what they're looking for is not on the first page of results and there are 100's-1000's more (?) (This total number of results would of course be per-repository, and thus significant risk that reference copies end up being double-counted vs. master copies, but as an "indication" might still be "good enough"?)

These would hopefully be fairly (very?) lightweight changes, but could have a significant effect in the near-term...

lpalashevski commented 4 years ago

We did similar tests by playing with the pageSize parameter in direct rest call to IGC. Indeed, significant difference in IGC search execution time - unfortunately did not saved the numbers but quite similar to what Chris already provided.

I am also supporting the idea to decrease the default pageSize (having it configurable in the connector is probably good practice) and I agree to this as a general principle. Also returning the total number of result is helpful and can be used to optimise the interaction on the user side. (think all igc paged responses have it as property so propagating it in egeria will be win for future optimisations as well).

First 10 results should matter most in most of the cases where we have relevance criteria included but this is not the case now. This raises back the discussion relevance vs. completeness or search vs query. Especially when thinking about more specific context driven search for access service (like it is for finding lineage worthy entities).

From what I know now, we do not have access service level paging mechanism (at least not is AssetCatalog OMAS) that works for the federated search results. We can re-open this discussion but as far as I remember it is rather complicated area and currently there is not solution in place.

As a last remark, form UI user perspective we do have the requirement to introduce paging. The question is whether it should be for the existing wide cohort query based on ALL property values for types or should start thinking about new approach where we can search on relevance (then think about limiting to list of properties valid for the context) and then rely on indexing and/or full text search capabilities.

cmgrote commented 4 years ago

Another quick update I've just made to the IGC connector (latest master) is to provide a simple caching mechanism that should remove (some) duplicate work on the IGC side for calculating search results.

Previously, any search for a reasonably "interesting" object (ie. one with classifications) would involve finding not just the main object in IGC, but then needing to run subsequent API calls (internally) to IGC to determine the classifications applied to each result. For example: searching for terms would require running one (big) query to find the terms that match (which might take 500+ms itself), but then also running a second query (per result) to determine each term's Confidentiality classification -- even when the Confidentiality classification of many of the terms may be the same. While each of these secondary queries would only typically take 50-60ms, that naturally adds up when there are many results (eg. even for just 10 results, with the same classification this would equate to 500-600ms of the overall response time).

With this new simple caching mechanism, I only need to retrieve the details of the classification once (50-60ms) and then can simply re-use these, avoiding 450-550ms of duplicate work (and therefore delay): potentially doubling the response time on even a fairly small page of results, and possibly removing full seconds or even a minute of work on a large page of results.

The "cache" itself is local to and only exists for the duration of a single Egeria API call -- it's simply caching any of the internal multiple IGC calls that may be necessary to serve that single Egeria API call -- so there should be minimal (if any) issues related to "refreshing" the cache or having it become excessively large. The only potential issue would be in regards to the size of the JVM heap, if there are a huge number of concurrent queries happening against the connector that cause many such caches to be created and exist in parallel (but I suspect even this is a fairly minimal risk).

Classification was the one I had in mind where duplicate work was clearly being done, but there may be other areas where similar gains could be quickly achieved; however I don't have any others top of mind at this point. So any profiling that you can do, and results you can share, of the performance within your particular environment would still help greatly there to conquer any other low-hanging fruit on the IGC connector side.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.