nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts
Other
2 stars 3 forks source link

No validation that submitted content matches the --dcc-id passed with cfde-submit #158

Closed ACharbonneau closed 2 years ago

ACharbonneau commented 3 years ago

I submitted the GTEx data as GTEx. Then I submitted exactly the same data but as LINCS. The question I was trying to answer is “What breaks if someone who is a submitter at two DCCs submits with the wrong DCC ID?” The answer is, that basically nothing breaks except that the titles don’t match in the UI. If I open the two submissions side by side, you can see it’s the same exact data, just the titles differ, depending on where you look at the title:

image

https://user-images.githubusercontent.com/1719360/109216482-94c00c80-7782-11eb-8cc4-13b79a1bb9bd.mov

From @karlcz To summarize some discussion started in the automation channel... the core UX question here is different expectations one might have:

  1. If we had more content validation rules, we might have rejected this GTEx data if submitted under the auspices of LINCS (UX stops w/ error before getting this far)
  2. The current dcc-review page gets its title from the review catalog content (the primary dcc contact page table). Browsing into the submission would show the same info as available project and primary dcc contact info.
  3. The submission system knows a different submitting_dcc which was declared via the submission tool, but isn't used in the page title
karlcz commented 3 years ago

If we do want to make this a strict enforcement requriement during ingest, I think we need to add a the project_id_namespace and project_id fields to the onboarding DCC table and then require that the single primary_dcc_contact in a submission match these values at the time of submission.

This raises interesting lifecycle behaviors if we allow the value to be edited in the registry, e.g. it would be enforced at time of submission and the currently registered value only affects future submissions. Not sure if we'd want to allow any DCC staff to edit this value via self-service or require CFDE-CC staff involvement in each change?

karlcz commented 3 years ago

Revisiting this, I think we should also add something like cfde_dcc_id to the C2M2 primary_dcc_contact table definition and require that this match the submitting_dcc during ingest.

karlcz commented 3 years ago

With the evolving plans for including stable DCC ID in the portal catalogs, I think the best approach here would be:

  1. Add dcc table to C2M2 as a replacement for dcc_primary_contact and include the CFDE-issued DCC ID as the id primary key, while still having the contact info and the foreign key to a project representing the DCC
  2. Having a (temporary) transitional mechanism to populate this during ingest of legacy C2M2 submissions
  3. Adding a validation check that the submission only populated 1 dcc entry and that its id matches the submitting dcc
karlcz commented 3 years ago

This is included in pending changes, but will not detect mistakes with legacy submissions with the primary_dcc_contact table, since the ingest process will be injecting the submitting DCC id itself so it will always match.