Closed yuanzhou closed 6 months ago
As I understand directions in the Description, I believe local_directory_rel_path
should not be a part of the OpenSearch document.
However, it is a part of the Neo4j entity data returned by query_target_entity(). The on_read_trigger
for this property modifies it to add the "scope directory" as a prefix, and /
as a suffix.
i.e. /Stanford TMC/86c5f68ae891b72357791a0de0a3308a
becomes public/Stanford TMC/86c5f68ae891b72357791a0de0a3308a/
I think this is a kind of middle ground between "don't create things just to delete them" and not putting incorrect info in the OSS document. Initially, I am going to
on_index_trigger
on local_directory_rel_path:
in provenance_schema.yamlproperties_to_exclude
argument to schema_manager.normalize_entity_result_for_response()
.@yuanzhou I got an email for your feedback on my comment above. I mis-typed on the comment, and have corrected it. You direction and my implementation should align, regarding the first bullet point, and I have implemented the second bullet point.
And, for some reason, your feedback is in my Inbox, but not visible to me on this issue...
@kburke let's do NOT make any changes to the existing on_read_trigger for this local_directory_rel_path field, which may be in use by other consumers. We only need to skip this field for the new on_index_trigger so it doesn't get indexed via the new endpoint GET /documents/
.
@kburke good catch on this special case!
By saying
- Not put an
on_read_trigger
onlocal_directory_rel_path
: in provenance_schema.yaml
Did you actually mean on_index_trigger
? We do NOT want to make any changes to the existing on_read_triggger
that has been used by the GET /entities/<id>
endpoint.
What you proposed should only apply to this new endpoint GET /documents/<id>
with using this new trigger type on_index_trigger
. That's why I had this properties_to_exclude
initially to skip certain fields before running a given trigger on it.
@yuanzhou Yes, I did mean on_index_trigger,
and have corrected my previous comment based on yours. Maybe it's April Fool's typing getting to us both...
During indexing, all we need are the node properties from Neo4j directly AND a few
on_read_trigger
generated ones. But the currentGET /entitites/<id>
returns ALL trigger generated properties, totally inefficient and unnecessary. And the index procedure has to remove the ones that are not specified in the mapping json with lots offor
loops, such repetitive work is a total waste of time and further reduces the efficiency.To address the above issues in the entity-api,
on_index_trigger
, which has the same trigger method as regularon_read_trigger
.get_complete_document_result()
which is very similar toget_complete_entity_result()
(for now do NOT modify this method), but useson_index_trigger
instead.cache_key = f'{_memcached_prefix}_complete_index_{entity_uuid}'
indexed: false
(default is true) for indexing purposes. This allows us to remove the use of json mapping in search-api. Whenexposed: false
there's no need to check thisindexed: false
since it's a field won't be exposed by entity-api. When there's onlyindexed: false
it means this field still gets returned to the regular GET call, but we won't index this field.normalize_document_result_for_response()
based on the currentnormalize_entity_result_for_response()
and integrate with theindexed
flag.Afterwards, create a specialized endpoint (based on the current
GET /entities/<id>
but with no property filtering needed):GET /documents/<id>
which returns a subset of the regular entity json to be used for index, without including the following generated fields (not used by index process):direct_ancestor
collections
,upload
,direct_ancestors
,local_directory_rel_path
(still need this field to be indexed)associated_collection
AFTER the new features get tested, switch to use this new endpoint in search-api: https://github.com/hubmapconsortium/search-api/issues/756