knowledgesystems / pipelines-scrum

Repository for tracking uncategorizable issues related to backend pipelines work
0 stars 0 forks source link

Shifted SV Columns #939

Closed averyniceday closed 2 years ago

averyniceday commented 2 years ago

Done Condition (What do we need? Why do we need it? Keep this is small as possible!)

Hongxin and Anusha noticed SV_Class being shifted off in the mskimpact cohort - figure out how it happened and fix data and bug

Technical Description (How are we going to achieve the above)

Potential Issues

Dependencies

Technical Requirements

Outside People/Teams

Changes

sheridancbio commented 2 years ago

Commands and filenames we used (Manda and I): git checkout c4c0da1e4c2142aa705245b76265dcc0fa0ce750 cp -a data_SV.txt ~/svrollout/mskimpact/data_SV_unmodified_june30.txt ... (all 4 cohorts) git checkout master cp data_sv.txt ~/svrollout/mskimpact/data_sv.txt ... (all 4 cohorts) ../repopulate_class/apply_data_sv_updates_modified_for_old_sv_schema.py data_SV_unmodified_june30.txt > data_SV_with_eventinfo_corrections.txt 2> data_SV_with_eventinfo_corrections.log ... (all 4 cohorts) python ../repopulate_class/backfill_class.py data_sv.txt data_SV_with_eventinfo_corrections.txt > data_sv_with_backfilled_class_field.txt ... (all 4 cohorts) cut -f 1,23 data_sv_with_backfilled_class_field.txt | sort | grep -v TRANSLOCATION | grep -v INVERSION | grep -v DELETION | grep -v INTRAGENIC_LOSS | grep -v DUPLICATION | cut -f 2 | sort | uniq -c # this checks whether all filled in values have been excluded from the "missing value" set cut -f 1,23 data_sv_with_backfilled_class_field.txt | sort | grep -v TRANSLOCATION | grep -v INVERSION | grep -v DELETION | grep -v INTRAGENIC_LOSS | grep -v DUPLICATION | cut -f 1 | sort | uniq > samples_missing_class_for_some_sv_event.ids ... (all 4 cohorts)

sheridancbio commented 2 years ago

we discovered today that mskarcher_unfiltered fetches (which come from cvr) do not (and apparently never did) contain the field for sv_class. We verified this with today's cvr_data.json for archer using this command: grep sv_class_name cvr_data.json

mskimpact fetches show regular population of this field.

the mskarcher_unfiltered fetch for today did have minimal populated sv events reported, such as this one: "sv-variants" : [ { "annotation" : "POSITIVE FOR GENE FUSIONS IN THE INVESTIGATIONAL PANEL:\n\nP2RY8-CRLF2 fusion.\nNote: The rearrangement is an in-frame fusion between genes P2RY8 undefined (NM_178129) and CRLF2 Exon1 (NM_022148).", "comments" : "Lab Notes\nRun Number: ###################4\nMacro-dissection: Unknown\n", "site1_gene" : "P2RY8", "site2_gene" : "CRLF2", "gene1" : "P2RY8", "gene2" : "CRLF2", "exon1" : "Exon1", "exon2" : "Exon1", "sample_comment" : "Lab Notes\nRun Number: XXXXXXXXXXXXXX4\nMacro-dissection: Unknown\n", "ct" : "21.23", "sample_notes" : null } ],

sheridancbio commented 2 years ago

while parallelizing the mskimpact run through cvr, we left the batchsize at 250 and when three simultaneous processes had queued things, we went above the 300 sample limit on the cvrqueue. We need to either do a linear run or smaller batches which stay under the 300 limit.

sheridancbio commented 2 years ago

in the msk heme study some events are actually coming from archer we need to figure out how to map a heme sample id to archer sample ids and fetch from archer the linked samples. Possible solution : fetch the related samples from archer, construct a key for the event but substitute the heme sample id into the key so that it can be matched to the heme study keys (for missing class sv events)

One example heme sample with a "merged in" archer event is P-0024884-T01-IH3

sheridancbio commented 2 years ago

another problem we found was that the cvr fetches we are receiving now often contain empty event_info fields, such as : "event_info": "-", Our fetchers have been updated to rewrite this field as "Gene1-Gene2 Fusion" and so the latest data_sv.txt files have non-empty event_info fields ... which means that the event key (which contains event_info as an element) does not match. We can correct this by applying the same logic to "fill in" empty event_info fields in the cvr fetch output as we use in the cvr_fetcher.jar.

sheridancbio commented 2 years ago

Rob : Don't forget that tomorrow is a new day ... and there will be brand new data_sv.txt files to start the analysis with.

sheridancbio commented 2 years ago

a note above says that we need to find a way to match heme to archer sample ids, but currently there is no point doing that ... there are no values for sv_class in the archer cohort so these events cannot be filled in.

sheridancbio commented 2 years ago

The fetches for mskimpact today were based on an older version of samples_missing_class_for_some_sv_event.ids (from a few days ago). However, when reconstituting the list from today's data_sv.txt, the resulting list was identical to the previous one, except that today's list would have excluded sample P-0034184-T01-IM6. So the fetch covers all samples which are still needed today. (this is sort of an indication that newly fetched samples are getting the sv_class field populated as expected.

sheridancbio commented 2 years ago

Because we ran overlapping batch fetches, the same sample came down in more than one concurrently run batch. So we need to extract sv_class on a batch-by-batch basis and then merge results (dropping duplicates between batches)

sheridancbio commented 2 years ago

I have checked whether or not event_info matters in the key for fetched events or events missing class. It does not. Cutting the key fields minus event_info out of the latest data_sv like this: cut -f 1,3,4,13,14,17,18 data_sv_with_backfilled_class_field.txt | sort | uniq -c | sort -n shows that there is not a single case where duplicate events are present based only on the sample_id, genes, chromosomes, and positions

sheridancbio commented 2 years ago

After extracting all the events (with keys) from the requeued and fetched 6523 mskimpact samples which had at least one sv event missing a Class value, only 260 of the events were able to be filled in based on the a newly fetched copy of the sample from the lynx server. This is surprisingly low. I'm going to look into why.

Some numbers: 6909 Total events missing class after the backfill from old timepoint 4918 events where some field ends in "- Archer" But that still leaves about 2000 events which might have been replaced -- and only 260 were replaced (based on a new sample fetch). Setting Archer aside and probing some random cases, it seems that some genes have changed in our fetcher output. for example, our data_sv.txt file has an event for P-0013114-T03-IM6 showing genes {LOC101929504 and VTCN1} but when re-fetched there is only an event with genes {VTCN1, VTCN1}. So something has changed here to come down with a different pair of genes and that is why we cannot match up the new event with the old event. Other examples of this: P-0021501-T01-IM6 {NOTCH4, PRICKLE4} apparently became {NOTCH4, NOTCH4} P-0024734-T01-IM6 {TNPO2, RECQL4} apparently became {RECQL4, RECQL4} P-0022364-T01-IM6 {UHMK1, PARP1} apparently became {PARP1, PARP1}

These patterns remind me that we have rewritten our fetch code so that we drop gene2 if gene1 == gene2. However, in the data_sv.txt file (excluding the Archer events and only looking at samples which had a missing class value), there are no events which do not have an actual value filled in for both gene1 and for gene2 (and they seem to always be different from each other).

Here is an informative case: P-0024518-T01-IM6 has a record in data_sv.txt involving genes {NF1, HCN1} ... when it was requeued and refetched to try to determine the class for this sample, we got this: "site1_gene": "NF1", "site2_gene": "NF1", "sv_class_name": "TRANSLOCATION", and no mention of HCN1 except in the site2 description field: "site2_desc": "IGR: 1Mb before HCN1(-)", It appears to me that something was simply updated upstream for this sample. To get this new information we basically need to requeue this sample. It is not feasible to match up changes like this programmatically.

Taking together the samples involving events (excluding - Archer) which could not be populated by requeue/fetch, and the samples which came down with non-present (i.e. missing) sv events when looking in the data_sv.txt file,

This raises the question of whether other events (which did have sv_class set to a value) have also been changed upstream and could/should be updated through requeue as well.

sheridancbio commented 2 years ago

We also saw events which came down on refetch which seemed to be changed even though they had a sv_class value. Sample P-0033176-T01-IM6 was in the data_sv file with genes {RBM20, FGFR3} and class TRANSLOCATION. On requeue and fetch, we received just one sv-variants record for this sample with: "site1_gene": "FGFR3" "site2_gene": "RBM20" "sv_class_name": "TRANSLOCATION",

So in this case, the order of the two genes in our data_sv file was reversed from the order in which they came down from cvr. I know that the order of genes is something we want to correct, but this is not a something I think we should be trying to massage at this point. I think we want our data_sv file to reflect the gene order as presented by cvr/lynx. So cases like this (which seem few) will also be requeued .. but this case was only found because there was a separate "- Archer" event without class information. So there are probably many more cases like this where we do not know that we could fix this discrepancy.

sheridancbio commented 2 years ago

It is clear that it is not possible to simply "fetch the current classes" and plug in missing values for many of the samples missing class.

I think to wrap up this effort we should:

  1. notify the dmp that the archer fetches are not coming fully populated and always are missing the "sv_class"."
  2. manually check in the changes for the events where we could backfill class from old versions of data_sv, and those we could populate based on a matching up a new fetch to an old event missing class. This covers only about 15% of the non-archer "classless" events
  3. take the import pipelines offline over the weekend and repeatedly fetch batches of 250? samples to work through the 2796 samples detected above (which had sv events that came down and did not match something currently in data_sv.txt. Then we let a long import run to annotate all of those ~3000 fetched samples.

Afterwards, we probably have many samples which will still be discrepant from the current sv-variants events in the cvr system. We should consider rebuilding the entire msk_impact_solid heme from scratch (starting with an empty study and queueing/fetching every sample into it using the latest fetching code.)

sheridancbio commented 2 years ago

Processing access:

sheridancbio commented 2 years ago

Processing heme:

in the case of heme, there two situations of being unable to match samples are not subsets of each other. In the union of the 33 and 8 samples, there are 35 distinct samples which should be requeued/re-imported after this backfill effort. see files:

sheridancbio commented 2 years ago

This card is now considered complete except for requeueing and doing full fetch processing of ~3000 samples where discrepancies were noticed between the current data_sv.txt files and a recent fetch of affected samples from lynx. We plan to do this processing over the Sep3/4 weekend.

sheridancbio commented 2 years ago

The samples to be requeued over the weekend are those which showed discrepancies between events in data_sv and the recent fetch. I originally thought some samples simply failed to requeue during our efforts, but after looking again it seems that all requested samples did indeed get fetched in the downloads for the batches.

All samples which fail to requeue during this weekend effort should be noted and then we should request the cvr team requeue the samples themselves (hopefully it will be relatively small) counts: mskimpact:

There are also related archer samples which could be requeued, but archer does not provide sv class information, so requeueing them will not help with the current effort. But there is a separate issue that events from archer unfiltered which were transferred into mskimpact and mskimpact_heme may have some surviving records with eventInfo "- Archer" ... although most of these should have been handled during the sv deployment when we ran scripts to populate eventInfo correctly. So this is not going to be done at this time.

sheridancbio commented 2 years ago

Batch 7 failed CVR fetch. [CVR fetch failed!] [CVR Germline fetch failed!] .. relaunching batch 7 without consuming

sheridancbio commented 2 years ago

Batches 0 through 9 are now complete .. apparently successfully fetched. That leaves only batch 10 of mskimpact and the additional hemepact samples and the access sample. I plan to requeue all of that and start a full run with the complete script .. except I will disable the check which cancels import if the fetch is not done by 5:30am.

sheridancbio commented 2 years ago

Remaining mskimpact batch, plus all samples from mskimpact_heme, mskarcher have been requeued. A full run is beginning tonight at 11:25. (3 and a half hours late). It will probably run through noon tomorrow. But if errors occur, we will try to make corrections for a run tomorrow night so the database is up to date again by Tuesday morning at the latest (hopefully).

sheridancbio commented 2 years ago

All work on this completed as expected. We can mark this done .. or we can do another survey of missing sv_class fields in the data_sv file

mandawilson commented 2 years ago

[cbioportal_importer@ip-10-1-141-114 mskimpact_heme]$ cut -f 23,26 data_sv.txt | awk -F" " '$2 !~ /Archer/ {print $1}' | sort | uniq -c 1 Class 305 DELETION 75 DUPLICATION 132 INVERSION 293 TRANSLOCATION