New features oriented to improve index procedure efficiency

yuanzhou commented 8 months ago

During indexing, all we need are the node properties from Neo4j directly AND a few on_read_trigger generated ones. But the current GET /entitites/<id> returns ALL trigger generated properties, totally inefficient and unnecessary. And the index procedure has to remove the ones that are not specified in the mapping json with lots of for loops, such repetitive work is a total waste of time and further reduces the efficiency.

To address the above issues in the entity-api,

Introduce a new trigger type: on_index_trigger, which has the same trigger method as regular on_read_trigger.
Add a specialized schema_manager method get_complete_document_result() which is very similar to get_complete_entity_result() (for now do NOT modify this method), but uses on_index_trigger instead.
For caching, use a different prefix on those index data, cache_key = f'{_memcached_prefix}_complete_index_{entity_uuid}'
Introduce a new flag in schema yaml indexed: false (default is true) for indexing purposes. This allows us to remove the use of json mapping in search-api. When exposed: false there's no need to check this indexed: false since it's a field won't be exposed by entity-api. When there's only indexed: false it means this field still gets returned to the regular GET call, but we won't index this field.
Add a specialized schema_manager method normalize_document_result_for_response() based on the current normalize_entity_result_for_response() and integrate with the indexed flag.

Afterwards, create a specialized endpoint (based on the current GET /entities/<id> but with no property filtering needed): GET /documents/<id> which returns a subset of the regular entity json to be used for index, without including the following generated fields (not used by index process):

Donor: None, since no on_read_trigger
Sample: direct_ancestor
Dataset/Publication (including revisions): collections, upload, direct_ancestors, local_directory_rel_path
Publication-specific: ~~associated_collection~~ (still need this field to be indexed)

AFTER the new features get tested, switch to use this new endpoint in search-api: https://github.com/hubmapconsortium/search-api/issues/756

kburke commented 7 months ago

As I understand directions in the Description, I believe local_directory_rel_path should not be a part of the OpenSearch document.

However, it is a part of the Neo4j entity data returned by query_target_entity(). The on_read_trigger for this property modifies it to add the "scope directory" as a prefix, and / as a suffix. i.e. /Stanford TMC/86c5f68ae891b72357791a0de0a3308a becomes public/Stanford TMC/86c5f68ae891b72357791a0de0a3308a/

I think this is a kind of middle ground between "don't create things just to delete them" and not putting incorrect info in the OSS document. Initially, I am going to

Not put an on_index_trigger on local_directory_rel_path: in provenance_schema.yaml
Remove what Neo4j returned within my new app.py method using the properties_to_exclude argument to schema_manager.normalize_entity_result_for_response().

kburke commented 7 months ago

@yuanzhou I got an email for your feedback on my comment above. I mis-typed on the comment, and have corrected it. You direction and my implementation should align, regarding the first bullet point, and I have implemented the second bullet point.

And, for some reason, your feedback is in my Inbox, but not visible to me on this issue...

@kburke let's do NOT make any changes to the existing on_read_trigger for this local_directory_rel_path field, which may be in use by other consumers. We only need to skip this field for the new on_index_trigger so it doesn't get indexed via the new endpoint GET /documents/.

yuanzhou commented 7 months ago

@kburke good catch on this special case!

By saying

Not put an on_read_trigger on local_directory_rel_path: in provenance_schema.yaml

Did you actually mean on_index_trigger? We do NOT want to make any changes to the existing on_read_triggger that has been used by the GET /entities/<id> endpoint.

What you proposed should only apply to this new endpoint GET /documents/<id> with using this new trigger type on_index_trigger. That's why I had this properties_to_exclude initially to skip certain fields before running a given trigger on it.

kburke commented 7 months ago

@yuanzhou Yes, I did mean on_index_trigger, and have corrected my previous comment based on yours. Maybe it's April Fool's typing getting to us both...

hubmapconsortium / entity-api

New features oriented to improve index procedure efficiency #630