hubmapconsortium / search-api

HuBMAP search service and associated pieces to create an index
https://search.api.hubmapconsortium.org
MIT License
2 stars 2 forks source link

Test MANY-to-one dataset test case #788

Closed shirey closed 1 month ago

shirey commented 4 months ago

Create a test case where a single dataset is connected to 2500 parent (direct ancestor) datasets, test the reindex of this single dataset.

yuanzhou commented 3 months ago
  1. Create a new Donor b835cb1865b044aac1220181ff19e2e5 (HBM344.GLSM.884) via Ingest Portal on DEV
  2. Create an organ f066e9ae118f0d3cf1fc8c77b8533fb9 (HBM873.FDBJ.735) from this Donor via Ingest Portal on DEV
  3. Create new dataset from the organ for 2500 times, in a script using curl
  4. Tried with 2000, 1500, 1000, 500 parent_ids, still timeout

Finally 400 parent_ids went through uuid-api and got new id created, then entity-api still returned 504. But the backend Neo4j went through eventually and created 9dc2de0838a4fd3e960b35424788950d Dataset with 400 direct ancestors.

https://ingest.dev.hubmapconsortium.org/dataset/9dc2de0838a4fd3e960b35424788950d

Two issues will need to be addresses first:

Afterwards, will need to test again and see how many direct ancestors the uuid-api would be able to handle without timeout.

yuanzhou commented 2 months ago

Timeout issue with the provenance call is addressed.

yuanzhou commented 1 month ago

7/11/2024

With the uuid-api improvements that @kburke did, I was able to create a new set of ids using the 2500 parent IDs in 7.5 seconds on DEV, bypassing entity-api though.

{
    "uuid": "0a00ec1ded41e861ef869387fedc7106",
    "hm_uuid": "0a00ec1ded41e861ef869387fedc7106",
    "hubmap_base_id": "777RVBL484",
    "hubmap_id": "HBM777.RVBL.484"
}

Will tackle the next one: Improve existence checks on direct ancestors in entity-api.

yuanzhou commented 1 month ago

7/17/2024, with the validation improvement made in entity-api via https://github.com/hubmapconsortium/entity-api/pull/700, @kburke was able to create a new dataset on DEV with 100 parents. It still took 24 seconds...

Should easily tip past AWS Gateway timeout for a few more, which I believe is attributable to the Activity relationship creation loops we just discussed.