ncihtan / htan-portal

The HTAN Data Portal
https://humantumoratlas.org
9 stars 11 forks source link

CHOP cases are not showing up #204

Closed inodb closed 3 years ago

inodb commented 3 years ago

Cases tab is empty:

Screen Shot 2021-04-01 at 10 56 33 AM
adamabeshouse commented 3 years ago

seems like its probably a data issue - theres no diagnosis showing up for these files.

alisman commented 3 years ago

@inodb @adamabeshouse maybe we should throw some errors when we find broken data. this would be very easy to do as part of the "lineage" routine that we run on load. it would be nice to catch this in the validation stage but someone would probably have to write much the same logic as we've already written to catch it.

adamabeshouse commented 3 years ago

@alisman so every file should have at least 1 diagnosis associated to it? what other data validation rules can we think of?

personally I think it would be better to catch it in validation so it can be fixed before it is pushed/goes live

inodb commented 3 years ago

yeah agreed that we should prolly try to catch this in validation - i got in touch with them to update the diagnosis data

inodb commented 3 years ago

Note that data portal requirements (such is it needs at least diagnosis clinical data) is sort of another level of validation specific to the data portal, but still prolly better to keep that where the other validation is done as well

adamabeshouse commented 3 years ago

@alisman further thought - if we need to write the same logic in validation as we would in frontend to catch this, maybe we should actually just go ahead and do that data processing in the validation/import stage.

inodb commented 3 years ago

There are two levels of validation atm:

  1. On Sage Bionetworks side they validate on submission (https://github.com/Sage-Bionetworks/schematic). So people submitting the data immediately see an error if e.g. there is an error in a field. Currently there is no higher level validation there that looks at the combination of multiple supplied files (e.g. where we need the entire tree of biospecimen dependencies)
  2. In hdash from Ethan: http://htan_dashboard.surge.sh/ (https://github.com/ncihtan/hdash). Here we can do validation on combinations of files like e.g. do all the biospecimen dependencies of files exist

Then there are also two normalization/transformation steps:

  1. In the get_syn_data.py script where we pull the data from synapse and massage it into the JSON that the frontend uses.
  2. in the frontend where we sort of flatten the dependency biospecimen tree so we can assign things like "cancer type" at the file level

So it is use case specific to figure out where the validation/normalization makes the most sense. In this case because it's about missing diagnosis data and not some missing field we could flag it in e.g. 2. Alternatively we put it in 3 since it's data portal specific and with 1-3 all being coded in python it wouldn't be too complex to move around anyway. That would make 3 more of a combo of validation/normalization.

adamabeshouse commented 3 years ago

Thanks for the thorough rundown. In my opinion, for this case we should do it in (3) because that way we can port the relevant logic from the frontend and attach those computations at that time (i.e. diagnosis, biospecimen, primaryParents), at the same time as we validate.

inodb commented 3 years ago

fyi this is fixed in their latest metadata