knowledgesystems / curation-scrum

Used for issue tracking of data curation efforts.
0 stars 0 forks source link

Validator updates - based on new SV format/requirements #1243

Closed rmadupuri closed 2 years ago

rmadupuri commented 2 years ago

https://docs.google.com/document/d/17hiqcLGmZCb1wLladmmj6ayUGVnoK6cD8eQvC6Fgqrc/edit

Sample_ID Site1_Hugo_Symbol Site1_Entrez_Gene_ID Site1_Chromosome Site1_Position Site1_Region_Number Site1_Ensembl_Transcript_ID Site2_Hugo_Symbol Site2_Entrez_Gene_ID Site2_Chromosome Site2_Position Site2_Region_Number Site2_Ensembl_Transcript_ID

rmadupuri commented 2 years ago

<-----------------------------------------------SV CLASS------------------------------------------------>

HEADER:

  1. Required Cols : Sample_ID, Event_Info else error

    Q: Requires column order?

  2. Need Site1_Hugo_Symbol or Site1_Entrez_Gene_Id else error Need Site2_Hugo_Symbol or Site2_Entrez_Gene_Id else error

  3. Need Site1_Exon & Site2_Exon else error Need Site1_Ensembl_Transcript_Id & Site2_Ensembl_Transcript_Id else error

DATA:

  1. If NCBI_Build

  2. Gene identification for Site1 and Site2 - treats as a normal genomic profile, special case for SV? (error) Entrez gene id and gene symbol are both missing (None)

    Needs a valid Site1_Hugo_Symbol or Site1_Entrez_Gene_Id else error Needs a valid Site2_Hugo_Symbol or Site2_Entrez_Gene_Id else error

    Q: intragenic, deletion cases where no valid symbols? So if gene identification for site1 or site2 is invalid still the validator should pass

    TODO: If profile is SV, symbol-Entrez pair can resolve to None. Both Site1 & Site2 can resolve to None? Or at least one is needed?

  3. If Event_Info == Fusion:

    The values for Site1, Site1 transcripts & exons are needed (for breakpoint visualization). Else Error

    Q: Event_Info is a free text, and it is not always equal to Fusion.

    TODO: Check for substring 'fusion' in Event_Info and apply the same conditions? Or is this test even needed?

  4. Check for transcripts and exons from genome nexus - the values in data file should correspond to what is in Genome nexus else error

    Each transcript contains known exons in Genome Nexus. Checks the correctness of Transcript-Exon pair.

    TODO: Is this functionality needed?

NEW TESTS:

  1. Validate uniqueness? Ex: Site1, Site2 (Hugo_Symbol, Entrez_Gene_Id), Sample_ID and Event_Info? Possibility of duplicates with values differing in other cols?

<-----------------------------------------------FUSION CLASS------------------------------------------------>

HEADER:

  1. Required Cols: Hugo_Symbol, Entrez_Gene_Id, Center, Tumor_Sample_Barcode, Fusion, DNA_support, RNA_support, Method, Frame Requires column order - True

DATA:

  1. Gene identification - requires valid symbol-entrez pair.

  2. Validates Uniqueness based on Hugo_Symbol, Entrez_Gene_Id, Sample_ID and Fusion cols.

<------------------------------------------GENE PANEL MATRIX CLASS-------------------------------------------->

  1. Allowed values : Targeted panels in DB, NA. Allow WGS, WXS