microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
4 stars 3 forks source link

response to a request for a data object that a curator has determined should be removed #530

Open dwinston opened 1 month ago

dwinston commented 1 month ago

Perhaps we should arrange for 410 Gone responses for the removed data, as a signal to anyone trying to resolve the bad data that their supplied ID was correct, but that NMDC found an issue with the data?

Context: This use case was just brought up at the weekly sync.

/cc @aclum @shreddd

aclum commented 1 month ago

Can you use the nmdc_deleted database to do this? I've been trying to make sure deletions go through queries:run

PeopleMakeCulture commented 1 month ago

Can you use the nmdc_deleted database to do this? I've been trying to make sure deletions go through queries:run.

@aclum Correct, data objects that are deleted through queries:run will currently go into the nmdc_deleted database.

There are a couple issues with this:

  1. This could lead to referential integrity issues if other documents reference the deleted object's id.
  2. The current approach does not differentiate between a record that cannot be found because the id does not exist and a record that does exist but was removed.

Here are a few ways to fix the ref_integrity issue:

  1. Leave deleted objects in the original database but suppress them with a flag (eg _removed). This would require updating all endpoints that get objects by id ( eg GET /objects/{object_id} to check for a _removed flag.

  2. Continue to move deleted objects to the nmdc_deleted database, but leave a marker in documents that reference the deleted object's id. When an object is deleted, search for all documents that contain a reference to that object's id. For each of those documents, leave a marker to tell the ref_integrity checker not to look for an object with the id of the deleted object. For example, we could add an internal _deleted_ids field on each document. Any time an object is deleted, we would look for all other documents that reference that object's id, and for each of those documents add the id of the deleted object to that document's list of _deleted_ids.

With either approach, we would want to have some message for the user to know that the record was suppressed.

aclum commented 1 month ago

We could/should also look at alternative identifier slots, this would help redirect users that have a legacy or outdated ID. We could use the schema hierarchy to do this. https://microbiomedata.github.io/nmdc-schema/alternative_identifiers/

PeopleMakeCulture commented 1 month ago

Leave deleted objects in the original database but suppress them with a flag (eg _removed). This would require updating all endpoints that get objects by id ( eg GET /objects/{object_id} to check for a _removed flag.

The qc_status slot on OmicsProcessing and WorkflowExecutionActivity objects should serve this function. Documents where qc_status == 'failed' should be suppressed by default from search results in portal and API search.