hubmapconsortium / entity-api

A set of web service calls to return information about HuBMAP entities
https://entity.api.hubmapconsortium.org
MIT License
3 stars 1 forks source link

New features oriented to improve index procedure efficiency #630

Closed yuanzhou closed 2 months ago

yuanzhou commented 4 months ago

During indexing, all we need are the node properties from Neo4j directly AND a few on_read_trigger generated ones. But the current GET /entitites/<id> returns ALL trigger generated properties, totally inefficient and unnecessary. And the index procedure has to remove the ones that are not specified in the mapping json with lots of for loops, such repetitive work is a total waste of time and further reduces the efficiency.

To address the above issues in the entity-api,

Afterwards, create a specialized endpoint (based on the current GET /entities/<id> but with no property filtering needed): GET /documents/<id> which returns a subset of the regular entity json to be used for index, without including the following generated fields (not used by index process):

AFTER the new features get tested, switch to use this new endpoint in search-api: https://github.com/hubmapconsortium/search-api/issues/756

kburke commented 3 months ago

As I understand directions in the Description, I believe local_directory_rel_path should not be a part of the OpenSearch document.

However, it is a part of the Neo4j entity data returned by query_target_entity(). The on_read_trigger for this property modifies it to add the "scope directory" as a prefix, and / as a suffix. i.e. /Stanford TMC/86c5f68ae891b72357791a0de0a3308a becomes public/Stanford TMC/86c5f68ae891b72357791a0de0a3308a/

I think this is a kind of middle ground between "don't create things just to delete them" and not putting incorrect info in the OSS document. Initially, I am going to

kburke commented 3 months ago

@yuanzhou I got an email for your feedback on my comment above. I mis-typed on the comment, and have corrected it. You direction and my implementation should align, regarding the first bullet point, and I have implemented the second bullet point.

And, for some reason, your feedback is in my Inbox, but not visible to me on this issue...

@kburke let's do NOT make any changes to the existing on_read_trigger for this local_directory_rel_path field, which may be in use by other consumers. We only need to skip this field for the new on_index_trigger so it doesn't get indexed via the new endpoint GET /documents/.

yuanzhou commented 3 months ago

@kburke good catch on this special case!

By saying

  • Not put an on_read_trigger on local_directory_rel_path: in provenance_schema.yaml

Did you actually mean on_index_trigger? We do NOT want to make any changes to the existing on_read_triggger that has been used by the GET /entities/<id> endpoint.

What you proposed should only apply to this new endpoint GET /documents/<id> with using this new trigger type on_index_trigger. That's why I had this properties_to_exclude initially to skip certain fields before running a given trigger on it.

kburke commented 3 months ago

@yuanzhou Yes, I did mean on_index_trigger, and have corrected my previous comment based on yours. Maybe it's April Fool's typing getting to us both...