uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
6 stars 1 forks source link

transcript + variant combination missing from three frame algorithm (coding germline indel) #165

Closed lydiayliu closed 3 years ago

lydiayliu commented 3 years ago

Not really sure how to approach this but the full table is here

/hot/users/yiyangliu/MoPepGen/Variant/VEP/gencode.aa/gindel/2021-10-16_3f_missing_variants.txt

Basically these are transcript + variant combinations that no longer produce a peptide in the three frame algorithm. I expect quite a few to be stop lost or in the first 3 bases on the start codon. But some of the variants that used to result in combinations of indels might be worth looking into?

lydiayliu commented 3 years ago

For example, can't really tell why

CPCG0465.gencode.aa.tsv.gvf:ENSG00000112659.14 5638 INDEL-5638-C-CGA C CGA . . TRANSCRIPT_ID=ENST00000372647.6;GENOMIC_POSITION=chr6:43187821-43187822;GENE_SYMBOL=CUL9

no longer produces variants.

The problem does seem to concentrate on a very limited number of 26 transcripts (and is probably even smaller considering the isoforms):

      1 ENST00000373563.9
      1 ENST00000409196.7
      1 ENST00000409451.7
      1 ENST00000409480.5
      1 ENST00000409547.5
      1 ENST00000629305.2
      2 ENST00000509479.6
      2 ENST00000644946.1
      5 ENST00000319555.8
      7 ENST00000430170.6
      7 ENST00000445164.6
      7 ENST00000524993.6
      7 ENST00000526090.1
      7 ENST00000528626.5
      9 ENST00000262494.12
      9 ENST00000376767.7
     13 ENST00000567390.6
     30 ENST00000252050.9
     30 ENST00000372647.6
     37 ENST00000374223.5
     41 ENST00000357089.8
     41 ENST00000374217.6
     41 ENST00000374221.7
     41 ENST00000374222.5
     56 ENST00000314675.11
    192 ENST00000360004.5

Maybe a reverse approach can be taken?

lydiayliu commented 3 years ago

ENST00000360004.5 is HLA lol. At least we didn't ignore HLA entirely though it is not really useful eitherway...

/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0339.gencode.aa.tsv.gvf
ENSG00000196126.11      5493    INDEL-5493-G-GTAT       G       GTAT    .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584355-32584356;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5494    INDEL-5494-AGGG-A       AGGG    A       .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584352-32584354;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5526    INDEL-5526-GGTGCGG-G    GGTGCGG G       .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584317-32584322;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5664    INDEL-5664-G-GGA        G       GGA     .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584184-32584185;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5665    INDEL-5665-C-CGG        C       CGG     .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584183-32584184;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5670    INDEL-5670-GC-G GC      G       .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584178;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5670    INDEL-5670-GCG-G        GCG     G       .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584177-32584178;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5677    INDEL-5677-GC-G GC      G       .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584171;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      8223    INDEL-8223-C-CA C       CA      .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32581625-32581626;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      8226    INDEL-8226-AG-A AG      A       .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32581622;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      8280    INDEL-8280-A-ATG        A       ATG     .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32581568-32581569;GENE_SYMBOL=HLA-DRB1

Everything except the first two indels are ignored, is that what we decided?

ENSG00000196126.11      5493    INDEL-5493-G-GTAT       G       GTAT    .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584355-32584356;GENE_SYMBOL=HLA-DRB1
ENSG00000196126.11      5494    INDEL-5494-AGGG-A       AGGG    A       .       .       TRANSCRIPT_ID=ENST00000360004.5;GENOMIC_POSITION=chr6:32584352-32584354;GENE_SYMBOL=HLA-DRB1
zhuchcn commented 3 years ago

CPCG0465.gencode.aa.tsv.gvf:ENSG00000112659.14 5638 INDEL-5638-C-CGA C CGA . . TRANSCRIPT_ID=ENST00000372647.6;GENOMIC_POSITION=chr6:43187821-43187822;GENE_SYMBOL=CUL9

I just ran it with the downsample reference and used this single variant, it actually produces peptides 24 peptides. Below are those peptides. Would it be possible that we now only report the first INDEL, so it is omitted from the FASTA header?

>ENST00000372647.6|INDEL-5638-C-CGA|28
KRLSPSK
>ENST00000372647.6|INDEL-5638-C-CGA|42
VILSCLTSFW
>ENST00000372647.6|INDEL-5638-C-CGA|37
QRPSPPR
>ENST00000372647.6|INDEL-5638-C-CGA|22
SPSRRPASQRK
>ENST00000372647.6|INDEL-5638-C-CGA|21
SPSRRPASQR
>ENST00000372647.6|INDEL-5638-C-CGA|4
FEGSTLNDLR
>ENST00000372647.6|INDEL-5638-C-CGA|30
RLSPSKQRPRPLR
>ENST00000372647.6|INDEL-5638-C-CGA|33
LSPSKQRPRPLRQRPSPPR
>ENST00000372647.6|INDEL-5638-C-CGA|41
QGPRPPWHRVILSCLTSFW
>ENST00000372647.6|INDEL-5638-C-CGA|10
STPSMGCCLMNQAAR
>ENST00000372647.6|INDEL-5638-C-CGA|40
QGPRPPWHR
>ENST00000372647.6|INDEL-5638-C-CGA|19
RSPSRRPASQR
>ENST00000372647.6|INDEL-5638-C-CGA|35
QRPRPLRQRPSPPR
>ENST00000372647.6|INDEL-5638-C-CGA|25
RPASQRKR
>ENST00000372647.6|INDEL-5638-C-CGA|8
STPRSTPSMGCCLMNQAAR
>ENST00000372647.6|INDEL-5638-C-CGA|24
RPASQRK
>ENST00000372647.6|INDEL-5638-C-CGA|5
FEGSTLNDLRSTPR
>ENST00000372647.6|INDEL-5638-C-CGA|34
QRPRPLR
>ENST00000372647.6|INDEL-5638-C-CGA|13
LLHEITPVPQIQK
>ENST00000372647.6|INDEL-5638-C-CGA|14
LLHEITPVPQIQKR
>ENST00000372647.6|INDEL-5638-C-CGA|32
LSPSKQRPRPLR
>ENST00000372647.6|INDEL-5638-C-CGA|15
LLHEITPVPQIQKRSPSR
>ENST00000372647.6|INDEL-5638-C-CGA|38
QRPSPPRQGPRPPWHR
>ENST00000372647.6|INDEL-5638-C-CGA|36
QRPRPLRQRPSPPRQGPRPPWHR
lydiayliu commented 3 years ago

Everything here should be recorded because peptides are missing, I don't care about the header as much... In my file, ENST00000372647.6|INDEL-5636-T-TACG only produced 3 peptides, as supposed to the many that the previous algorithm produced.

CPCG0465.gencode.aa.tsv.gvf.3f.fasta:>ENST00000252050.9|INDEL-5633-ACC-A|3 ENST00000372647.6|INDEL-5633-ACC-A|3                                                             
CPCG0465.gencode.aa.tsv.gvf.3f.fasta-FEGSTLNDAQLPDLHQVWAAV                                                                                                                  
--                                                                                                                                                                          
CPCG0465.gencode.aa.tsv.gvf.3f.fasta:>ENST00000252050.9|INDEL-5636-T-TACG|4 ENST00000372647.6|INDEL-5636-T-TACG|4                                                           
CPCG0465.gencode.aa.tsv.gvf.3f.fasta-FEGSTLNDLR                                                                                                                             
--                                                                                                                                                                          
CPCG0465.gencode.aa.tsv.gvf.3f.fasta:>ENST00000252050.9|INDEL-5636-T-TACG|5 ENST00000372647.6|INDEL-5636-T-TACG|5                                                           
CPCG0465.gencode.aa.tsv.gvf.3f.fasta-FEGSTLNDLRLNSQIYTK   

Lemme rerun the GVF and get back to this.

lydiayliu commented 3 years ago

Running commit ced6b01 of main on

/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0465.gencode.aa.tsv.gvf

still gave me the same result, with the three peptides only. The rest of the peptides that you listed are not produced in the fasta... Can you please double check this XD

lydiayliu commented 3 years ago

I updated the format of the table, would something like this

CPCG0465        ENST00000252050.9       INDEL-5638-C-CGA        24
CPCG0465        ENST00000252050.9       INDEL-5638-C-CGA|INDEL-5641-AACT-A      3
CPCG0465        ENST00000252050.9       INDEL-5641-AACT-A       3
CPCG0396        ENST00000262494.12      INDEL-152341-C-CGA      4
CPCG0396        ENST00000262494.12      INDEL-152345-CAA-C      5
CPCG0396        ENST00000314675.11      INDEL-35965-GTCCCGGTCCCGGCCCCAGTCCCGGTCCCGGTCCCGGCCCCAGTCCCGGTCCCGGTCCCGGCCCCAGTCCCTGTCCTGG-G   3

Be easier to investigate?

zhuchcn commented 3 years ago

ENST00000360004.5 is HLA lol. At least we didn't ignore HLA entirely though it is not really useful eitherway...

This has been fixed in 22b4061 . Some variants were not incorporated into the TVG at all in some cases. This was a big one..

Will look at others shortly.

zhuchcn commented 3 years ago

Seems to be fixed! Could you verify?

lydiayliu commented 3 years ago

Both of these example cases were fixed!!! Moving on to the rest.

For both cases, the new three frame actually produces 2 extra peptides compared to the old algorithm XD The old algo seems to had a bug that missed miscleavages XD

lydiayliu commented 3 years ago

Good news, out of the 13 samples that had this missing variant problem, 5 are fully resolved. I'll post the 8 remaining cases here for case by case staring lol.

Case 1:

/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0233.gencode.aa.tsv.gvf
ENST00000314675.11
ENST00000357089.8
ENST00000374217.6
ENST00000374221.7
ENST00000374222.5
ENST00000374223.5

In these transcripts for this sample, the new three frame algorithm produces completely non-overlapping peptides from the old algorithm...

These transcripts are essentially all isoforms of the same gene, and the following variants don't produce anything in the three frame algorithm (unique peptides were produced by these variants in the old algorithm).

INDEL-35971-GTCCCGG-G|INDEL-35987-CGG-C
INDEL-35971-GTCCCGG-G|INDEL-35987-CGGTCCCGG-C
INDEL-35987-CGG-C
INDEL-35987-CGG-C|INDEL-35971-GTCCCGG-G
INDEL-35987-CGGTCCCGG-C
INDEL-35987-CGGTCCCGG-C|INDEL-35971-GTCCCGG-G

The variant below the only variant that produces peptides in the new algorithm, but for some reason the 4 peptides it produces are completely different from the peptides that the old algorithm used to produce for this variant lmao.

INDEL-35958-TGTCCCGGTCCCGGTCCCGGCCCCAGTCC-T

similarly ENST00000567390.6 is missing these variants

INDEL-5298-GGAGCTACGGGATCAGGAG-G
INDEL-5303-TAC-T
INDEL-5556-GAAGCAGGAGGAGCAGATGGGG-G

but produces different peptides from this variant

INDEL-5286-ACGGGATCAGGAGGAGC-A
lydiayliu commented 3 years ago

Case 2:

Almost identical to above

/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0235.gencode.aa.tsv.gvf
ENST00000314675.11
ENST00000357089.8
ENST00000374217.6
ENST00000374221.7
ENST00000374222.5
ENST00000374223.5

Missing these variants

INDEL-35971-GTCCCGG-G|INDEL-35987-CGG-C
INDEL-35971-GTCCCGG-G|INDEL-35987-CGGTCCCGG-C
INDEL-35987-CGG-C|INDEL-35971-GTCCCGG-G
INDEL-35987-CGGTCCCGG-C|INDEL-35971-GTCCCGG-G

ENST00000567390.6 missing INDEL-5805-GGGGGAGCAGATG-G

Case 3: /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0267.gencode.aa.tsv.gvf Also almost identical

Case 4: /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0333.gencode.aa.tsv.gvf Involves the same transcripts

Case 5: /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0396.gencode.aa.tsv.gvf Involves the same transcripts

lydiayliu commented 3 years ago

Case 6 and 7

/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0336.gencode.aa.tsv.gvf
/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0346.gencode.aa.tsv.gvf

Involves transcript ENST00000376767.7 variants

INDEL-284589-GAATGA-G
INDEL-284592-T-TG
INDEL-284594-A-AATGG
INDEL-284594-A-AATGG|INDEL-284592-T-TG

Only INDEL-284579-AAATGGAATGGAATGAAATGG-A produces a peptide in this transcript in both samples, the peptide produced is the same as the one produced in the old algo.

lydiayliu commented 3 years ago

Case 8:

/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0366.gencode.aa.tsv.gvf

the following set of isoforms:

ENST00000430170.6
ENST00000445164.6
ENST00000524993.6
ENST00000526090.1
ENST00000528626.5

Missing the following set of variants

INDEL-10847-CAGGCATCTCCAGCCC-C
INDEL-10855-TCC-T

ThisINDEL-10841-CCAGCCCAGGCATC-C produces peptides in ENST00000430170.6 in the new algorithm. But the 4 peptides produced are different from the old algo on this variant.

zhuchcn commented 3 years ago

Case 1 fixed in 83588ee . Will test others tomorrow.

lydiayliu commented 3 years ago

/hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode.aa/gindel/CPCG0267.gencode.aa.tsv.gvf not fixed yet due to #167

Otherwise all other cases are resolved!!! Closing this issue to defer things to #167