nih-cfde / the-fair-cookbook

NIH CFDE project for creating a collection of FAIRification recipes
https://cfde-published-documentation.readthedocs-hosted.com/en/latest/
2 stars 6 forks source link

Findings from KF ETL Walk-through - C2M2 Level 1 Schema Mismatch #43

Closed dcfcjcheadle closed 4 years ago

dcfcjcheadle commented 4 years ago

Hi @ACharbonneau and @rpwagner,

I've been tasked with looking through some of the available public documentation, from the perspective of a new DCC, and making any comments or suggestions for edits to the content and scripts. Over the course of this walk through, I've found that some of the KF data does not match the C2M2 Level 1 schema. Specifically, I'm following the Experience from KidsFirst (KF) ETL walk-through. I've summarized these findings in a Powerpoint (attached) and have a zip file of the data+script which I'm happy to share. We wanted to bring this to your attention to:

  1. See if this necessitates a change to the C2M2 data model
  2. If no change is necessary, update the documentation/script to handle this and explain to users what was performed

Please let me know, and let me know if I can clarify anything. Thanks!

John

KF ETL Walkthrough Findings.pptx

dcfcjcheadle commented 4 years ago

FWIW @abhijna and @marisalim I just saw you both assign yourself - I can put proposed edits to the documentation and R script into a PR whenever, but wanted to clarify some of the issues I found prior to doing that. Although I'll defer to you for when/where to share these things, thanks!

ACharbonneau commented 4 years ago

Hey @dcfcjcheadle Abhijna and Marisa can look to see if the data didn't import correctly from what they built into the model, however if the model reflects what they built then everything is fine.

All of these datasets (except MoTrPAC, HMP and LINCS) were built by the coordination center rather the DCC, we know they're not necessarily 'right'. They are not, and were never meant to be, the data that would actually display in the model on public launch. They're all going to be fixed/rebuilt by the DCCs once they have funding.

So, we are not going to change the scripts or how we built the data to go into the model, as it's the record of what we had for the demo, and a thing the DCCs can use as a basis for their own builds. However, if you think there is a mismatch between what Abhijna submitted and what is displayed in the model, that could be useful. (Depending on what specific mismatch there is, @abradyIGS did some hacking of Abhijna's submission in post, so there are a couple specific things we know don't match)

abhijna commented 4 years ago

@dcfcjcheadle thanks for bringing this to my attention. I can go back over the R script early next week. It's possible there were some last min changes to the code that didn't get pushed to github. Feel free to make a PR now or after you've heard from me next week :)

dcfcjcheadle commented 4 years ago

@ACharbonneau and @abhijna thanks for the responses and the context! I'll defer to you for interest in any proposed edits.

abhijna commented 4 years ago

Hi @dcfcjcheadle ,

I finally had a chance to go over your slides and my code. Both the problems you described, to some extent, come from the fact that some cells in the KF data spreadsheet have multiple entries. In many cases, those multiple entries are repeats. E.g.:

Screen Shot 2020-09-10 at 8 28 35 AM

While preparing C2M2 data tables for ingest, we decided to drop all but the first entry. Maybe there is a better way to do it, but KF needs to make a decision about which entries they want to keep and/or how they want to restructure their raw data to make wrangling for ingest easier.

Here are some other thoughts (and questions) on the two problems you pointed out with the KF ingest:

1) Multiple assay types per file (2 unique files): Again, this is a KF raw data problem. In their data spreadsheet, they have two different experimental strategies (separated by a ,) in the same cell (see screenshot).

file_id_vs_assay_type

When my code splits the data into individual rows, since the rows are not unique (i.e. one has WGS and the other has RNA-seq for the same file id), they are retained in the final dataset. KF will need to make the decision about which "strategies" are best applicable to those file ids.

2) Multiple assay types per biosample (90 unique biosamples): I can't seem to replicate this error. In the biosample.tsv input table that is generated by running my R script, I don't see any of the duplicated biospecimen IDs you have in slide 10. After you downloaded the KF data, could you do this part of the pre-processing described in the cookbook recipe?

Find and replace KF variables in raw data TSVs with C2M2's controlled vocabulary (CV):

file_format: replace with corresponding EDAM terms from this link
data_type: replace with corresponding data terms from this link. Look for the "data:" tag.
anatomy: replace with corresponding UBERON ID from this link
assay_type : replace with corresponding terms from this link. This was done programmatically (in R) for the first pass because there are only 3 possible values.

Based on the table you have on slide 11 (below), it looks like you're working with raw data.

Biospecimen.ID | Composition | Experiment.Strategy
-- | -- | --
BS_QXZZY4SJ | Solid Tissue | WXS
BS_QXZZY4SJ | Solid Tissue | WGS
BS_NV00B2Y5 | Solid Tissue | WGS
BS_NV00B2Y5 | Solid Tissue | WXS
BS_SXVVN52M | Solid Tissue | WGS
BS_SXVVN52M | Solid Tissue | WXS
BS_69KVWAWK | Solid Tissue | WGS
BS_69KVWAWK | Solid Tissue | WXS
BS_P8X3GAEP | Solid Tissue | WGS
BS_P8X3GAEP | Solid Tissue | WXS
BS_BBK329W5 | Solid Tissue | WGS
BS_BBK329W5 | Solid Tissue | WXS
BS_8JTH2X5Z | Solid Tissue | WGS
BS_8JTH2X5Z | Solid Tissue | WXS
BS_XX68G6KE | Solid Tissue | WGS
BS_XX68G6KE | Solid Tissue | WXS
BS_W05MC7QG | Solid Tissue | WGS
BS_W05MC7QG | Solid Tissue | WXS
BS_D7442ACV | Solid Tissue | WGS
BS_D7442ACV | Solid Tissue | Linked-Read WGS (10x Chromium)
BS_WCEHT7D0 | Solid Tissue | WGS
BS_WCEHT7D0 | Solid Tissue | WXS
BS_KE5D12KG | Solid Tissue | WGS
BS_KE5D12KG | Solid Tissue | WXS
BS_4GHFYCWZ | Solid Tissue | WGS
BS_4GHFYCWZ | Solid Tissue | WXS
BS_KW1B1Y8P | Solid Tissue | WXS
BS_KW1B1Y8P | Solid Tissue | WGS
BS_ABZC437A | Solid Tissue | WGS
BS_ABZC437A | Solid Tissue | WXS
BS_QEKBZRMD | Solid Tissue | WGS
BS_QEKBZRMD | Solid Tissue | WXS
BS_P6FPBJM8 | Solid Tissue | WGS
BS_P6FPBJM8 | Solid Tissue | WXS
BS_J8YAY1HM | Solid Tissue | WGS
BS_J8YAY1HM | Solid Tissue | WXS
BS_NK1FJNRG | Solid Tissue | WGS
BS_NK1FJNRG | Solid Tissue | WXS
BS_25EXB29C | Solid Tissue | WGS
BS_BTRGQHM4 | Solid Tissue | WGS
BS_BTRGQHM4 | Solid Tissue | WXS
…. | …. | ….

To answer your question about whether the model would need to be changed, it's important to note that I built this data set by pulling KF data from their public portal, for a demo. As soon as FOA funding is completed, the DCCs themselves will be building their own datasets, and there is no reason to believe they would be doing it based on files they make publicly accessible in formats designed for end-users to build experiments with. They'll be building from their own internal data models, which likely have entirely different structures, so there is nothing about our demo data that is informative about the 'real world' case of Kids First metadata fitting the model.

dcfcjcheadle commented 4 years ago

Hi @abhijna,

Thank you for walking through that again. I didn't fully appreciate the context behind these walk-throughs so that was very helpful for me.

What I brought up in this issue has been brought up with the tech team in nih-cfde/cfde-deriva#92, so I'm going to close this issue as a duplicate.