nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts
Other
2 stars 3 forks source link

Refactor dbGaP Study ID UX as a psuedo vocabulary #360

Closed karlcz closed 1 year ago

karlcz commented 2 years ago

The UX thinking on the dbgap_study_id field has evolved a little. Along with showing the value and using it for visual feedback of user-specific access rights to files in dbGaP, we have added a search facet.

The C2M2 field only carries a raw ID of the form phs000..., but the dbGaP website has more information available for these IDs. If we add a vocabulary-like table to the portal model, we could import additional information and present this column more consistently with other controlled vocabularies, i.e. track id, name, and description fields and give a richer value picker.

However, to avoid making the system overly dependent on the dbGaP website during catalog builds, we should probably determine an asynchronous/caching model to discover and accumulate new study information during the submission and review process and store it within the portal system so that release catalogs can be built reliably...

The portal and registry dbgap_study table should represent asynchronous states where we have discovered a study ID but have not yet obtained extra information. If retrieval fails, we might show these values in a degraded state in order to allow review catalogs to be built with best available information.

~The optional CLI might be useful to run periodically (i.e. prior to releases) and gather any updated study titles or descriptions for study IDs we have interrogated in the past. Meanwhile, the core discovery process during ingest can use the registry as a cache to reduce potentially frequent/bursty interrogation of the same IDs during DCC resubmission actions.~

The optional refactoring with core_fact is a portal optimization, the details and/or benefit of which would depend on the statistical correlation of these study IDs with other controlled vocabulary usage.

The optional feedback for failed study IDs might be important during a submission review process, to alert DCC submitters and reviewers that there might be a problem with their file table content?

karlcz commented 2 years ago

The design above has been tweaked to follow the same general approach as with other vocabularies in the portal.

  1. An offline process is used to extract canonical study information from dbGaP and include it in the cfde-deriva repo as a TSV file.
  2. The registry tracks usage of study IDs by individual submissions
  3. When a submission uses a study ID not defined by the canonical TSV, it is encoded with a static study name unknown.
  4. The UX displays both the study ID and the study name, so that known values are enriched but unknown values can still be seen or selected in the filtering facet.
karlcz commented 2 years ago

A preview of this is available in the app-dev catalog "1".

karlcz commented 1 year ago

This was released already