Closed lydiayliu closed 1 year ago
Was the reditools gvf file generated by a older version of mpg? The reference nucleotide of the variant records in the gvf file doesn't match with the reference.
It's old but not sure how old XD I'll do it again
The version in the GVF file is 0.2.0. I remember that we fixed some bugs in parseREDItools but can't be sure what that was exactly.
Still getting the same error
Command wrapper:
[ 2023-01-05 17:39:50 ] moPepGen callVariant started
[ 2023-01-05 17:43:33 ] Reference indices loaded.
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462_ijc5_sjc5.s_4.gvf
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462.s_2.gvf
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462_candidates.rmsk.GRCh38_annotated.txt.mincd30_3.gvf
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462.gencode.tsv.s_5.gvf
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462.gencode.tsv.s_6.gvf
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462.gencode.tsv.s_7.gvf
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462_circularRNA_known.txt.1.s_1.gvf
[ 2023-01-05 17:45:06 ] Using GVF file: CPCG0462.gencode.tsv.s_8.gvf
[ 2023-01-05 17:45:06 ] Variants sorted
[ 2023-01-05 17:46:01 ] [ !!! moPepGen WARNING !!! ] Hypermutated region detected from graph: 'ENST00000377263.6'. The argument max_variants_per_node = 7 and additional_variants_per_misc = 2 was used to reduce complexity.
[ 2023-01-05 17:46:21 ] 1000 transcripts processed.
[ 2023-01-05 17:46:53 ] Exception raised from CIRC-ENST00000430580.6-E25-E26-E27-E28
Traceback (most recent call last):
File "/usr/local/bin/moPepGen", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 89, in main
args.func(args)
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 353, in call_variant_peptide
results = [wrapper_local(dispatches[0])]
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 252, in wrapper_local
return call_variant_peptides_wrapper(*dispatch)
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 232, in call_variant_peptides_wrapper
_peptides = call_peptide_circ_rna(
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 489, in call_peptide_circ_rna
cgraph.fit_into_codons()
File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/ThreeFrameTVG.py", line 1593, in fit_into_codons
end_nodes = self.expand_alignments(cur)
File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/ThreeFrameTVG.py", line 1523, in expand_alignments
ref_node = start.get_reference_next()
File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/TVGNode.py", line 309, in get_reference_next
raise ValueError('No reference edge was found.')
ValueError: No reference edge was found.
Work dir:
work/66/880e2118ba78cdb722b9154ee6cefa
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Work dir:
/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/work_pipe/85/666ac6234ca909a3f1384d0d0ab406
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
The REDITools GVF has been updated
yiyangliu@ip-0A125215:/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/log$ head /hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/REDItools/CPCG0462_candidates.rmsk.GRCh38_annotated.txt.mincd30.gvf
##fileformat=VCFv4.2
##mopepgen_version=0.10.1
##parser=parseREDItools
##reference_index=/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/ref/GRCh38-EBI-GENCODE34/index
##genome_fasta=
##annotation_gtf=
##source=reditools
##CHROM=<Description="Gene ID">
##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=String,Description="Transcript ID">
##INFO=<ID=GENE_SYMBOL,Number=1,Type=String,Description="Gene Symbol">
Here are all the GVFs
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/CIRCexplorer2/CPCG0462_circularRNA_known.txt.1.s.gvf
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/Fusion/star-fusion-1.9.1/CPCG0462.s.gvf
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/REDItools/CPCG0462_candidates.rmsk.GRCh38_annotated.txt.mincd30.gvf
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/RMATS/CPCG0462_ijc5_sjc5.s.gvf
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/VEP/gencode/gindel/CPCG0462.gencode.tsv.s.gvf
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/VEP/gencode/gsnp/CPCG0462.gencode.tsv.s.gvf
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/VEP/gencode/pindel/CPCG0462.gencode.tsv.s.gvf
/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/VEP/gencode/somaticsniper/CPCG0462.gencode.tsv.s.gvf
This is the old REDITools GVF on this transcript
yiyangliu@ip-0A125215:/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/REDItools/v0.2.0$ grep ENST00000430580.6 CPCG0462*.s.gvf
ENSG00000219481.10 27973 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16585590;STRAND=-1
ENSG00000219481.10 27974 RES-A-G A G . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16585589;STRAND=-1
ENSG00000219481.10 28029 RES-C-G C G . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16585534;STRAND=-1
ENSG00000219481.10 30827 RES-A-T A T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16582736;STRAND=-1
ENSG00000219481.10 36176 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16577387;STRAND=-1
ENSG00000219481.10 46367 RES-C-A C A . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16567196;STRAND=-1
ENSG00000219481.10 46372 RES-G-C G C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16567191;STRAND=-1
ENSG00000219481.10 46379 RES-A-G A G . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16567184;STRAND=-1
ENSG00000219481.10 47140 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16566423;STRAND=-1
ENSG00000219481.10 47782 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16565781;STRAND=-1
ENSG00000219481.10 47783 RES-C-T C T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16565780;STRAND=-1
ENSG00000219481.10 47784 RES-A-T A T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16565779;STRAND=-1
ENSG00000219481.10 49384 RES-C-T C T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16564179;STRAND=-1
ENSG00000219481.10 50275 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563288;STRAND=-1
ENSG00000219481.10 50276 RES-T-A T A . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563287;STRAND=-1
ENSG00000219481.10 50290 RES-G-A G A . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563273;STRAND=-1
ENSG00000219481.10 50412 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563151;STRAND=-1
ENSG00000219481.10 50416 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563147;STRAND=-1
ENSG00000219481.10 50419 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563144;STRAND=-1
ENSG00000219481.10 50434 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563129;STRAND=-1ENSG00000219481.10 50462 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563101;STRAND=-1
ENSG00000219481.10 50492 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563071;STRAND=-1
ENSG00000219481.10 50545 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563018;STRAND=-1ENSG00000219481.10 50555 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563008;STRAND=-1ENSG00000219481.10 50556 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563007;STRAND=-1
ENSG00000219481.10 50592 RES-A-C A C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16562971;STRAND=-1
ENSG00000219481.10 50644 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16562919;STRAND=-1ENSG00000219481.10 50645 RES-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16562918;STRAND=-1
This is the new GVF (I went for the more filtered version)
yiyangliu@ip-0A125215:/hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Parser/REDItools$ grep ENST00000430580.6 CPCG0462*.gvf
ENSG00000219481.10 27973 RES-27973-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16585590;STRAND=-1
ENSG00000219481.10 27974 RES-27974-A-G A G . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16585589;STRAND=-1
ENSG00000219481.10 28029 RES-28029-C-G C G . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16585534;STRAND=-1
ENSG00000219481.10 30827 RES-30827-A-T A T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16582736;STRAND=-1
ENSG00000219481.10 46367 RES-46367-C-A C A . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16567196;STRAND=-1
ENSG00000219481.10 46372 RES-46372-G-C G C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16567191;STRAND=-1
ENSG00000219481.10 46379 RES-46379-A-G A G . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16567184;STRAND=-1
ENSG00000219481.10 47140 RES-47140-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16566423;STRAND=-1
ENSG00000219481.10 47782 RES-47782-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16565781;STRAND=-1
ENSG00000219481.10 47783 RES-47783-C-T C T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16565780;STRAND=-1
ENSG00000219481.10 47784 RES-47784-A-T A T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16565779;STRAND=-1
ENSG00000219481.10 49384 RES-49384-C-T C T . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16564179;STRAND=-1
ENSG00000219481.10 50275 RES-50275-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563288;STRAND=-1
ENSG00000219481.10 50276 RES-50276-T-A T A . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563287;STRAND=-1
ENSG00000219481.10 50290 RES-50290-G-A G A . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563273;STRAND=-1
ENSG00000219481.10 50412 RES-50412-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563151;STRAND=-1
ENSG00000219481.10 50416 RES-50416-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563147;STRAND=-1
ENSG00000219481.10 50419 RES-50419-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563144;STRAND=-1
ENSG00000219481.10 50434 RES-50434-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563129;STRAND=-1
ENSG00000219481.10 50462 RES-50462-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563101;STRAND=-1
ENSG00000219481.10 50492 RES-50492-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563071;STRAND=-1
ENSG00000219481.10 50545 RES-50545-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563018;STRAND=-1
ENSG00000219481.10 50555 RES-50555-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563008;STRAND=-1
ENSG00000219481.10 50556 RES-50556-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16563007;STRAND=-1
ENSG00000219481.10 50592 RES-50592-A-C A C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16562971;STRAND=-1
ENSG00000219481.10 50644 RES-50644-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16562919;STRAND=-1
ENSG00000219481.10 50645 RES-50645-T-C T C . . TRANSCRIPT_ID=ENST00000430580.6;GENOMIC_POSITION=chr1:16562918;STRAND=-1
Can you show me the path to the original reditools output?
It is in this folder: /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/rnaedit/annotated
The whole situation is a bit complicated because the reditools calls were originally on GRCh37 and were lifted over, and then you had to run some annotations on them, but this is the "final" product that I use to make the GVFs
Ah, I remember that now. Did I do this? It could be that the position of the rna editing somehow got mismatched. Will look into it.
I thought I did it. I think the code is here: https://github.com/uclahs-cds/project-MissingPeptides-Method/pull/21#pullrequestreview-879625630. But then you did the double checking and annotations: https://github.com/uclahs-cds/project-MissingPeptides-Method/issues/12#issuecomment-1031846155
Btw these same 0.2.0 GVFs ran before with CPCG, so we are finding a previously un-addressed problem?
The issue is because the coordinate of the RNA editings somehow got messed up, so for a variant for example, RES-27973-T-C, the reference nucleotide at position 27973 is in fact C, so both the reference node and the variant node has the same nucleotide (C). And in certain cases for example, if this is the stop codon, then the variant node should be kept. This then results none of the variant in the bubble is reference node. It's hard to tell way it did work before. Maybe it was because node collapsing was not introduced back then, but the RNA editing files are still problematic.
Interesting, so can these entries be removed during parsing?
Let me find out whether it was liftover or parseREDITools that causes the positions being mismatched.
It turns out that the REDItools output is stranded. The reference nucleotide it has in its column No. 3 is the one in the gene. I thought it was the nucleotide on reference genome (unstranded). That's why I'm seeing half of the variant records parsed from the reditools output don't match.. 🤮
Not sure if this is new after #632, but this sample was not run before