Refactor dbGaP Study ID UX as a psuedo vocabulary

karlcz commented 2 years ago

The UX thinking on the dbgap_study_id field has evolved a little. Along with showing the value and using it for visual feedback of user-specific access rights to files in dbGaP, we have added a search facet.

The C2M2 field only carries a raw ID of the form phs000..., but the dbGaP website has more information available for these IDs. If we add a vocabulary-like table to the portal model, we could import additional information and present this column more consistently with other controlled vocabularies, i.e. track id, name, and description fields and give a richer value picker.

However, to avoid making the system overly dependent on the dbGaP website during catalog builds, we should probably determine an asynchronous/caching model to discover and accumulate new study information during the submission and review process and store it within the portal system so that release catalogs can be built reliably...

[x] Add a dbgap_study_id table to the portal and registry models as a sibling to other vocabularies
[x] Digest the submission's file.dbgap_study_id column as a source of relevant dbgap_study_id.id keys
[x] Refactor the portal's file.dbgap_study_id field as a foreign key to the dbgap_study_id "term" table
~During ingest process, interrogate the dbGaP web resources to learn corresponding name/description info for new keys~
[x] Include a dbgap_study_id.tsv vocab in the ingest process, similar to other built-in ontology info
[x] Store the results into the registry ~for reuse~
~Consume the stored values when building release catalogs for more deterministic behavior~
~(optional) Add an admin CLI to poll the dbGaP web resources and refresh info for terms we have already learned?~
[x] (optional) Refactor the dbgap_study_id column and search mechanisms into core_fact like other file metadata fields?
[x] (optional) Add UX feedback for "failed" dbgap study IDs? I.e. where the dbGaP website doesn't know about a study ID?

The portal and registry dbgap_study table should represent asynchronous states where we have discovered a study ID but have not yet obtained extra information. If retrieval fails, we might show these values in a degraded state in order to allow review catalogs to be built with best available information.

~The optional CLI might be useful to run periodically (i.e. prior to releases) and gather any updated study titles or descriptions for study IDs we have interrogated in the past. Meanwhile, the core discovery process during ingest can use the registry as a cache to reduce potentially frequent/bursty interrogation of the same IDs during DCC resubmission actions.~

The optional refactoring with core_fact is a portal optimization, the details and/or benefit of which would depend on the statistical correlation of these study IDs with other controlled vocabulary usage.

The optional feedback for failed study IDs might be important during a submission review process, to alert DCC submitters and reviewers that there might be a problem with their file table content?

karlcz commented 2 years ago

The design above has been tweaked to follow the same general approach as with other vocabularies in the portal.

An offline process is used to extract canonical study information from dbGaP and include it in the cfde-deriva repo as a TSV file.
The registry tracks usage of study IDs by individual submissions
When a submission uses a study ID not defined by the canonical TSV, it is encoded with a static study name unknown.
The UX displays both the study ID and the study name, so that known values are enriched but unknown values can still be seen or selected in the filtering facet.

karlcz commented 2 years ago

A preview of this is available in the app-dev catalog "1".

karlcz commented 1 year ago

This was released already

nih-cfde / cfde-deriva

Refactor dbGaP Study ID UX as a psuedo vocabulary #360