kids-first / kf-api-dataservice

:file_cabinet: Primary API for interacting with the Kids First data
http://kf-api-dataservice.kidsfirstdrc.org
Apache License 2.0
5 stars 3 forks source link

✨ Add visibility reason and comment columns #604

Closed znatty22 closed 1 year ago

znatty22 commented 1 year ago

Motivation

The visibility column on each table is a boolean field which indicates whether that row should be included or not in the study release process. The problem is we often don't know why something was set to visible or hidden because we don't track this information anywhere.

Approach

Hey everyone, I've collected all of the responses on the dataservice visibility PR and summarized the resulting changes that people seem to want. @allison if you or anyone has any other changes/suggestions, please let me know. I'll wait to hear back before I implement anything

Add new visibility_reason column

Restrict values to enumeration

Add new visibility_comment column

Examples:

Peddy issue due to relationship mismatch

visibility = false
visibility_reason = Sample Issue
visibility_comment = Peddy issue - relationship mismatch

Sample issue due to NGSCheckmate mismatch

visibility = false
visibility_reason = Sample Issue
visibility_comment = NGSCheckmate mismatch

Data is in embargo period

visibility = false
visibility_reason = Pre Release
visibility_comment = In embargo hold

Sequencing data file has low coverage

visibility = false
visibility_reason = Sequencing Quality Issue
visibility_comment = Low coverage data does not meet quality standards
znatty22 commented 1 year ago

From last week's discussion, change ready_for_release to staged_for_release

aw3334 commented 1 year ago

Maybe something like embargo_hold or pending_release: visibility is set to false due to 6 month embargo before data is released on portal

Christina-J-Diaz commented 1 year ago

Can we add enums for these 3 reasons:

thoughts?

youngnm commented 1 year ago

How about- nautilus_issue / missing_from_nautilus / or ? This is relevant for the weekly CBTN refresh. Samples that are no longer available in Nautilus will change from visible to hidden.

liberaliscomputing commented 1 year ago

This is what we have needed and thanks for working on this, @znatty22! I have a few thoughts:

  1. Enums casing: for other our custom enums, we capitalize the first character of every word (See, for example, https://github.com/kids-first/kf-api-dataservice/blob/master/dataservice/api/biospecimen/schemas.py#L19-L29) while this PR brings in snake casing. I wonder if we want to use uniform casing.
  2. FHIR resources have a property called meta.security where the following Value Set is bound: https://www.hl7.org/fhir/valueset-security-labels.html. We are migrating our data over to FHIR, and if we can map the above values to the existing ones in the Value Set resource, that seems ideal. If they are not completely mapped, we can extend this Value Set since the binding strength is Extensible (See https://www.hl7.org/fhir/resource.html#Meta).
chris-s-friedman commented 1 year ago

Maybe something like embargo_hold or pending_release: visibility is set to false due to 6 month embargo before data is released on portal

I like the pending_release value. This makes it more general that for this "thing" to be set visible, we are waiting for regulatory approval (e.g. dbgap approval)

baileyckelly commented 1 year ago
  1. Re "embargo" vs "pending release" : these are 2 different values imo. We can have files that are pending release (like new genomic file versions) that are not in embargo vs an entire sample/study/subject that is in embargo. I think it would be good to know the difference between these two things.
  2. Re "low-coverage": I think we could come up with a few file-based quality issues. Do we want to separate those out or do we want to bucket them all into "sequencing-quality-issue" or something? Idc - I guess i'm just thinking if we want to be specific by having "low coverage" then we should have other specific file quality issues OR call them all general "file-quality-issue"
  3. Re "peddy issue": Do we also want to call out ngs-check-mate issues or tumor-normal-mismatch? I also wonder if we could get more specific than "peddy issue": we know if its a gender mismatch and/or a family relationship issue vs are we hiding the whole family even though the proband just has a gender issue.
  4. Re "unknown": Should we also add "other"? Theres a difference between "unknown" (when we don't know why) and "other" where we know why but there is just not an enum.
  5. Please don't kill me for this suggestion: What do you think about adding another field that allows for a free text explanation? I can see that being helpful as I type out the above. Example 1: visibility-reason: peddy-issue visibility-comment: Father/Proband family relationship issue, hid entire family.

Example 2: Visibility-reason: in-embargo visibility-comment: expected release date 12/13/23

Example 3: visibility reason: file-quality-issue visibility comment: RNA below 50M read cutoff

znatty22 commented 1 year ago

What do you think about adding another field that allows for a free text explanation? I can see that being helpful as I type out the above.

I think this is a great idea @baileyckelly.

Keep in mind the following. If you think you will want to run queries to search for entities that have been hidden for specific reasons like ngs-check-mate then having a specific enum will make that a lot easier. If we have a more general enum value for these types of issues like peddy-issue, then you'd have to search for peddy-issue entities and then somehow, maybe manually, browse through the comments to figure out which are related to the ngs-check-mate issue. Although, I suppose you can always use a standard comment text for ngs-check-mate too though so maybe it won't be an issue

Christina-J-Diaz commented 1 year ago

Keep in mind the following. If you think you will want to run queries to search for entities that have been hidden for specific reasons like ngs-check-mate then having a specific enum will make that a lot easier. If we have a more general enum value for these types of issues like peddy-issue, then you'd have to search for peddy-issue entities and then somehow, maybe manually, browse through the comments to figure out which are related to the ngs-check-mate issue. Although, I suppose you can always use a standard comment text for ngs-check-mate too though so maybe it won't be an issue

Yea I think there's value in having a general enum for the reason and then a more specific free text column for an explanation that maybe we build common constants for, like @znatty22 suggested. Think this would help make querying easier and also be helpful for any reason that is marked as "Other".

In terms of querying, for example, this would be useful in cases where we would want to find all peddy issues -- regardless of the specific reason, or all data that is pending a release and then use the comment section to further filter if needed.

I would imagine something like: general enum: peddy_issue specific comment:gender mismatch or relationship mismatch

Also to the point of releases -- @baileyckelly how do you feel about the following? general enum: pre_release specific comment: embargo,pending new versions..etc.