hubmapconsortium / search-api

HuBMAP search service and associated pieces to create an index
https://search.api.hubmapconsortium.org
MIT License
2 stars 2 forks source link

Remove top-level copied field `files` #805

Closed yuanzhou closed 1 month ago

yuanzhou commented 3 months ago

During the index/reindex runtime, the Dataset ingest_metadata.files field gets copied to a top-level filed files. Recently we've come across some datasets that contain a large number of ingest_metadata.files (the Dataset field ingest_metadata gets renamed to metadata during index runtime) entries (fbf3af732f53b00f20a9ecc1ecc3c52b for instance, the payload size 2MB).

Screenshot 2024-05-22 at 10 54 57 PM

Such duplicates have caused:

We should remove the original one and only keep the copied version.

yuanzhou commented 3 months ago

@lchoy @john-conroy @NickAkhmetov @bherr2 will this change affect any of your UI handlings?

john-conroy commented 3 months ago

Having the files at the top level of the doc would break our UI and require some work in the portal-ui.

bherr2 commented 3 months ago

We read from metadata.files. Does this affect that?

PS. Here are the fields we query for / use: https://github.com/hubmapconsortium/ccf-ui/blob/main/projects/ccf-database/src/lib/xconsortia/xconsortia-data-import.ts#L17-L38

yuanzhou commented 3 months ago

@john-conroy @bherr2 does this mean the portal-ui and ccf-ui are not consuming the top-level files (copied from metadata.files) at all?

bherr2 commented 3 months ago

On ccf-ui side, that's correct.

yuanzhou commented 3 months ago

@bherr2 @john-conroy if you are sure you don't use the top-level files field, we'll plan to remove it, is that fine with you?

There will be additional upcoming changes to the Dataset metadata.files and metadata.metadata in the near future. We'll discuss and come up with a plan.

bherr2 commented 3 months ago

Fine by me

john-conroy commented 3 months ago

I'll have to look through our repos before I can fully confirm.

yuanzhou commented 1 month ago

Closing this issue, will handle this separately.