Closed lydiayliu closed 3 years ago
For example, can't really tell why
CPCG0465.gencode.aa.tsv.gvf:ENSG00000112659.14 5638 INDEL-5638-C-CGA C CGA . . TRANSCRIPT_ID=ENST00000372647.6;GENOMIC_POSITION=chr6:43187821-43187822;GENE_SYMBOL=CUL9
no longer produces variants.
The problem does seem to concentrate on a very limited number of 26 transcripts (and is probably even smaller considering the isoforms):
1 ENST00000373563.9
1 ENST00000409196.7
1 ENST00000409451.7
1 ENST00000409480.5
1 ENST00000409547.5
1 ENST00000629305.2
2 ENST00000509479.6
2 ENST00000644946.1
5 ENST00000319555.8
7 ENST00000430170.6
7 ENST00000445164.6
7 ENST00000524993.6
7 ENST00000526090.1
7 ENST00000528626.5
9 ENST00000262494.12
9 ENST00000376767.7
13 ENST00000567390.6
30 ENST00000252050.9
30 ENST00000372647.6
37 ENST00000374223.5
41 ENST00000357089.8
41 ENST00000374217.6
41 ENST00000374221.7
41 ENST00000374222.5
56 ENST00000314675.11
192 ENST00000360004.5
Maybe a reverse approach can be taken?
ENST00000360004.5 is HLA lol. At least we didn't ignore HLA entirely though it is not really useful eitherway...
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0339.gencode.aa.tsv.gvf
ENSG00000196126.11 5493 INDEL-5493-G-GTAT G GTAT . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584355-32584356;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5494 INDEL-5494-AGGG-A AGGG A . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584352-32584354;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5526 INDEL-5526-GGTGCGG-G GGTGCGG G . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584317-32584322;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5664 INDEL-5664-G-GGA G GGA . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584184-32584185;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5665 INDEL-5665-C-CGG C CGG . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584183-32584184;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5670 INDEL-5670-GC-G GC G . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584178;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5670 INDEL-5670-GCG-G GCG G . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584177-32584178;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5677 INDEL-5677-GC-G GC G . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584171;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 8223 INDEL-8223-C-CA C CA . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32581625-32581626;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 8226 INDEL-8226-AG-A AG A . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32581622;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 8280 INDEL-8280-A-ATG A ATG . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32581568-32581569;GENE_SYMBOL=HLA-DRB1
Everything except the first two indels are ignored, is that what we decided?
ENSG00000196126.11 5493 INDEL-5493-G-GTAT G GTAT . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584355-32584356;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11 5494 INDEL-5494-AGGG-A AGGG A . . TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584352-32584354;GENE_SYMBOL=HLA-DRB1
CPCG0465.gencode.aa.tsv.gvf:ENSG00000112659.14 5638 INDEL-5638-C-CGA C CGA . . TRANSCRIPT_ID=ENST00000372647.6;GENOMIC_POSITION=chr6:43187821-43187822;GENE_SYMBOL=CUL9
I just ran it with the downsample reference and used this single variant, it actually produces peptides 24 peptides. Below are those peptides. Would it be possible that we now only report the first INDEL, so it is omitted from the FASTA header?
>ENST00000372647.6|INDEL-5638-C-CGA|28
KRLSPSK
>ENST00000372647.6|INDEL-5638-C-CGA|42
VILSCLTSFW
>ENST00000372647.6|INDEL-5638-C-CGA|37
QRPSPPR
>ENST00000372647.6|INDEL-5638-C-CGA|22
SPSRRPASQRK
>ENST00000372647.6|INDEL-5638-C-CGA|21
SPSRRPASQR
>ENST00000372647.6|INDEL-5638-C-CGA|4
FEGSTLNDLR
>ENST00000372647.6|INDEL-5638-C-CGA|30
RLSPSKQRPRPLR
>ENST00000372647.6|INDEL-5638-C-CGA|33
LSPSKQRPRPLRQRPSPPR
>ENST00000372647.6|INDEL-5638-C-CGA|41
QGPRPPWHRVILSCLTSFW
>ENST00000372647.6|INDEL-5638-C-CGA|10
STPSMGCCLMNQAAR
>ENST00000372647.6|INDEL-5638-C-CGA|40
QGPRPPWHR
>ENST00000372647.6|INDEL-5638-C-CGA|19
RSPSRRPASQR
>ENST00000372647.6|INDEL-5638-C-CGA|35
QRPRPLRQRPSPPR
>ENST00000372647.6|INDEL-5638-C-CGA|25
RPASQRKR
>ENST00000372647.6|INDEL-5638-C-CGA|8
STPRSTPSMGCCLMNQAAR
>ENST00000372647.6|INDEL-5638-C-CGA|24
RPASQRK
>ENST00000372647.6|INDEL-5638-C-CGA|5
FEGSTLNDLRSTPR
>ENST00000372647.6|INDEL-5638-C-CGA|34
QRPRPLR
>ENST00000372647.6|INDEL-5638-C-CGA|13
LLHEITPVPQIQK
>ENST00000372647.6|INDEL-5638-C-CGA|14
LLHEITPVPQIQKR
>ENST00000372647.6|INDEL-5638-C-CGA|32
LSPSKQRPRPLR
>ENST00000372647.6|INDEL-5638-C-CGA|15
LLHEITPVPQIQKRSPSR
>ENST00000372647.6|INDEL-5638-C-CGA|38
QRPSPPRQGPRPPWHR
>ENST00000372647.6|INDEL-5638-C-CGA|36
QRPRPLRQRPSPPRQGPRPPWHR
Everything here should be recorded because peptides are missing, I don't care about the header as much... In my file, ENST00000372647.6|INDEL-5636-T-TACG
only produced 3 peptides, as supposed to the many that the previous algorithm produced.
CPCG0465.gencode.aa.tsv.gvf.3f.fasta:>ENST00000252050.9|INDEL-5633-ACC-A|3 ENST00000372647.6|INDEL-5633-ACC-A|3
CPCG0465.gencode.aa.tsv.gvf.3f.fasta-FEGSTLNDAQLPDLHQVWAAV
--
CPCG0465.gencode.aa.tsv.gvf.3f.fasta:>ENST00000252050.9|INDEL-5636-T-TACG|4 ENST00000372647.6|INDEL-5636-T-TACG|4
CPCG0465.gencode.aa.tsv.gvf.3f.fasta-FEGSTLNDLR
--
CPCG0465.gencode.aa.tsv.gvf.3f.fasta:>ENST00000252050.9|INDEL-5636-T-TACG|5 ENST00000372647.6|INDEL-5636-T-TACG|5
CPCG0465.gencode.aa.tsv.gvf.3f.fasta-FEGSTLNDLRLNSQIYTK
Lemme rerun the GVF and get back to this.
Running commit ced6b01
of main on
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0465.gencode.aa.tsv.gvf
still gave me the same result, with the three peptides only. The rest of the peptides that you listed are not produced in the fasta... Can you please double check this XD
I updated the format of the table, would something like this
CPCG0465 ENST00000252050.9 INDEL-5638-C-CGA 24
CPCG0465 ENST00000252050.9 INDEL-5638-C-CGA|INDEL-5641-AACT-A 3
CPCG0465 ENST00000252050.9 INDEL-5641-AACT-A 3
CPCG0396 ENST00000262494.12 INDEL-152341-C-CGA 4
CPCG0396 ENST00000262494.12 INDEL-152345-CAA-C 5
CPCG0396 ENST00000314675.11 INDEL-35965-GTCCCGGTCCCGGCCCCAGTCCCGGTCCCGGTCCCGGCCCCAGTCCCGGTCCCGGTCCCGGCCCCAGTCCCTGTCCTGG-G 3
Be easier to investigate?
ENST00000360004.5 is HLA lol. At least we didn't ignore HLA entirely though it is not really useful eitherway...
This has been fixed in 22b4061 . Some variants were not incorporated into the TVG at all in some cases. This was a big one..
Will look at others shortly.
Seems to be fixed! Could you verify?
Both of these example cases were fixed!!! Moving on to the rest.
For both cases, the new three frame actually produces 2 extra peptides compared to the old algorithm XD The old algo seems to had a bug that missed miscleavages XD
Good news, out of the 13 samples that had this missing variant problem, 5 are fully resolved. I'll post the 8 remaining cases here for case by case staring lol.
Case 1:
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0233.gencode.aa.tsv.gvf
ENST00000314675.11
ENST00000357089.8
ENST00000374217.6
ENST00000374221.7
ENST00000374222.5
ENST00000374223.5
In these transcripts for this sample, the new three frame algorithm produces completely non-overlapping peptides from the old algorithm...
These transcripts are essentially all isoforms of the same gene, and the following variants don't produce anything in the three frame algorithm (unique peptides were produced by these variants in the old algorithm).
INDEL-35971-GTCCCGG-G|INDEL-35987-CGG-C
INDEL-35971-GTCCCGG-G|INDEL-35987-CGGTCCCGG-C
INDEL-35987-CGG-C
INDEL-35987-CGG-C|INDEL-35971-GTCCCGG-G
INDEL-35987-CGGTCCCGG-C
INDEL-35987-CGGTCCCGG-C|INDEL-35971-GTCCCGG-G
The variant below the only variant that produces peptides in the new algorithm, but for some reason the 4 peptides it produces are completely different from the peptides that the old algorithm used to produce for this variant lmao.
INDEL-35958-TGTCCCGGTCCCGGTCCCGGCCCCAGTCC-T
similarly
ENST00000567390.6
is missing these variants
INDEL-5298-GGAGCTACGGGATCAGGAG-G
INDEL-5303-TAC-T
INDEL-5556-GAAGCAGGAGGAGCAGATGGGG-G
but produces different peptides from this variant
INDEL-5286-ACGGGATCAGGAGGAGC-A
Case 2:
Almost identical to above
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0235.gencode.aa.tsv.gvf
ENST00000314675.11
ENST00000357089.8
ENST00000374217.6
ENST00000374221.7
ENST00000374222.5
ENST00000374223.5
Missing these variants
INDEL-35971-GTCCCGG-G|INDEL-35987-CGG-C
INDEL-35971-GTCCCGG-G|INDEL-35987-CGGTCCCGG-C
INDEL-35987-CGG-C|INDEL-35971-GTCCCGG-G
INDEL-35987-CGGTCCCGG-C|INDEL-35971-GTCCCGG-G
ENST00000567390.6
missing INDEL-5805-GGGGGAGCAGATG-G
Case 3:
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0267.gencode.aa.tsv.gvf
Also almost identical
Case 4:
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0333.gencode.aa.tsv.gvf
Involves the same transcripts
Case 5:
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0396.gencode.aa.tsv.gvf
Involves the same transcripts
Case 6 and 7
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0336.gencode.aa.tsv.gvf
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0346.gencode.aa.tsv.gvf
Involves transcript ENST00000376767.7
variants
INDEL-284589-GAATGA-G
INDEL-284592-T-TG
INDEL-284594-A-AATGG
INDEL-284594-A-AATGG|INDEL-284592-T-TG
Only INDEL-284579-AAATGGAATGGAATGAAATGG-A
produces a peptide in this transcript in both samples, the peptide produced is the same as the one produced in the old algo.
Case 8:
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0366.gencode.aa.tsv.gvf
the following set of isoforms:
ENST00000430170.6
ENST00000445164.6
ENST00000524993.6
ENST00000526090.1
ENST00000528626.5
Missing the following set of variants
INDEL-10847-CAGGCATCTCCAGCCC-C
INDEL-10855-TCC-T
ThisINDEL-10841-CCAGCCCAGGCATC-C
produces peptides in ENST00000430170.6
in the new algorithm. But the 4 peptides produced are different from the old algo on this variant.
Case 1 fixed in 83588ee . Will test others tomorrow.
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0267.gencode.aa.tsv.gvf
not fixed yet due to #167
Otherwise all other cases are resolved!!! Closing this issue to defer things to #167
Not really sure how to approach this but the full table is here
/hot/users/yiyangliu/MoPepGen/Variant/VEP/gencode.aa/gindel/2021-10-16_3f_missing_variants.txt
Basically these are transcript + variant combinations that no longer produce a peptide in the three frame algorithm. I expect quite a few to be stop lost or in the first 3 bases on the start codon. But some of the variants that used to result in combinations of indels might be worth looking into?