openvax / isovar

Assembly of RNA reads to determine the effect of a cancer mutation on protein sequence
Apache License 2.0
25 stars 13 forks source link

Unexpectedly short protein sequences #90

Open iskandr opened 6 years ago

iskandr commented 6 years ago

Issue reported by email from Scott Brown:

I have a question regarding the protein sequence output from Isovar. I have seen multiple cases where the protein that is output is shorter than the requested ‘--protein-sequence-length’, but when checking the locus in IGV there appears to be sufficient mutant read support covering both sides of the mutation site. Is there a parameter that is limiting the length of the protein, or is there some other reason that the returned protein is not matching the requested length?

...

It appears to be finding two possible proteins based on two transcripts, and despite having equal read support, choosing the shorter of the two (it outputs MYPTAALIVNLRPNTF, whereas manual inspection suggests to me that it should output AAGAVEWMYPTAALIVNLRPNTF).

Separately, I found another case where the somatic mutation was the very last nucleotide of an exon, and Isovar output the peptide based on two reads which went directly into the intron rather than nearly 20 mutant reads which jumped to the next exon correctly. I suspect this is because only the intronic reads were recruited since Isovar takes reads that map to the region +/-1 nucleotide of the mutation? In this case, right at the edge of the exon, it seems like this may be undesired behaviour, at least when there are more reads containing the mutation which skip the intron and read into the next exon?

Aside from those two cases, I think I have been able to explain all shorter-than-expected proteins due to insufficient numbers of high quality mapped reads.

STDOUT from isovar invocation:

$ PYTHONPATH="/projects/sbrown_prj/bin/isovar/" python2 /projects/sbrown_prj/bin/isovar/script/isovar-protein-sequences.py --vcf POG616_17_81042903_snv.pvcf --bam /projects/POG/POG_data/POG616/rna/biop1_t_P01608/reposition/hg19a_jg-e69_bwa-mem-0.7.6a/C9J57ANXX_8_CCGCAA_withJunctionsOnGenome_dupsFlagged.bam --min-alt-rna-reads 2 --protein-sequence-length 23 --output POG616_17_81042903_isovar.csv

2017-10-25 12:24:58,904 - isovar.allele_reads:200 - INFO - Gathering reads for Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37')
2017-10-25 12:24:58,918 - isovar.allele_reads:208 - INFO - Gathering variant reads for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') (chromosome = 17, gene names = [u'METRNL'])
2017-10-25 12:24:58,947 - isovar.locus_reads:314 - INFO - Found 486 reads overlapping locus 17: 81042902-81042904
2017-10-25 12:24:58,947 - isovar.variant_sequences:422 - INFO - Initial pool of 28 variant sequences (min length=39, max length=72)
2017-10-25 12:24:58,948 - isovar.assembly:140 - INFO - Collapsed 28 -> 19 sequences
2017-10-25 12:24:58,954 - isovar.variant_sequences:440 - INFO - After overlap assembly: 3 variant sequences (min length=72, max length=72)
2017-10-25 12:24:58,955 - isovar.variant_sequences:207 - INFO - Coverage: [14 14 14 14 14 14 14 14 14 14 15 15 15 16 17 19 19 19 20 20 21 21 22 23 23
 24 25 25 25 25 26 27 27 28 28 28 28 28 28 27 27 27 27 26 25 25 25 25 25 24
 23 23 23 23 22 20 19 19 18 18 17 16 16 16 16 16 16 16 16 16 15 15] (len=72)
2017-10-25 12:24:58,955 - isovar.variant_sequences:207 - INFO - Coverage: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 4 5 5 5 5 6 7 7 8 8 8 8
 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7] (len=72)
2017-10-25 12:24:58,956 - isovar.variant_sequences:207 - INFO - Coverage: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] (len=72)
2017-10-25 12:24:58,956 - isovar.variant_sequences:330 - INFO - Kept 1/3 variant sequences after read coverage trimming to >=2x
2017-10-25 12:24:58,956 - isovar.variant_sequences:456 - INFO - After coverage & length filtering: 1 variant sequences (min length=72, max length=72)
2017-10-25 12:24:59,500 - pyensembl.sequence_data:95 - INFO - Loaded sequence dictionary from /home/sbrown/.cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.cdna.all.fa.gz.pickle
2017-10-25 12:24:59,795 - pyensembl.sequence_data:95 - INFO - Loaded sequence dictionary from /home/sbrown/.cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.pep.all.fa.gz.pickle
2017-10-25 12:24:59,859 - isovar.effect_prediction:65 - INFO - Predicted total 3 effects for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37')
2017-10-25 12:24:59,859 - isovar.effect_prediction:73 - INFO - Keeping 3/3 effects which affect protein coding sequence for Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37'): <EffectCollection with 3 elements>
  -- Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-001, transcript_id=ENST00000320095, effect_description=p.G87A)
  -- Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-005, transcript_id=ENST00000570778, effect_description=p.G5A)
  -- Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-003, transcript_id=ENST00000571814, effect_description=p.G5A)
2017-10-25 12:24:59,860 - isovar.effect_prediction:84 - INFO - Keeping 3 effects with predictable AA sequences for Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37'): [Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-001, transcript_id=ENST00000320095, effect_description=p.G87A), Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-005, transcript_id=ENST00000570778, effect_description=p.G5A), Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-003, transcript_id=ENST00000571814, effect_description=p.G5A)]
2017-10-25 12:24:59,860 - isovar.reference_sequence_key:113 - INFO - Interbase offset range on METRNL-001 for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') = 384:385
2017-10-25 12:24:59,860 - isovar.reference_sequence_key:113 - INFO - Interbase offset range on METRNL-005 for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') = 120:121
2017-10-25 12:24:59,860 - isovar.reference_sequence_key:113 - INFO - Interbase offset range on METRNL-003 for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') = 954:955
2017-10-25 12:24:59,860 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED]', cdna_alt='C', cdna_suffix='[REDACTED]', reference_prefix='[REDACTED]', reference_suffix='[REDACTED]', n_trimmed=0
2017-10-25 12:24:59,861 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED][REDACTED]', offset_to_first_complete_codon=2, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED]', reference_cdna_sequence_after_variant='[REDACTED]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
2017-10-25 12:24:59,861 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED]', cdna_alt='C', cdna_suffix='[REDACTED]', reference_prefix='[REDACTED]', reference_suffix='[REDACTED]', n_trimmed=0
2017-10-25 12:24:59,861 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED][REDACTED]', offset_to_first_complete_codon=23, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED]', reference_cdna_sequence_after_variant='[REDACTED]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
2017-10-25 12:24:59,861 - isovar.protein_sequences:293 - INFO - TranslationKey(amino_acids='MYPTAALIVNLRPNTF', variant_aa_interval_start=4, variant_aa_interval_end=5, ends_with_stop_codon=False, frameshift=False): 29 alt reads supporting protein sequence (gene names = set([u'METRNL']))
2017-10-25 12:24:59,861 - isovar.protein_sequences:305 - INFO - TranslationKey(amino_acids='MYPTAALIVNLRPNTF', variant_aa_interval_start=4, variant_aa_interval_end=5, ends_with_stop_codon=False, frameshift=False): protein sequence = MYPTAALIVNLRPNTF
2017-10-25 12:24:59,861 - isovar.protein_sequences:293 - INFO - TranslationKey(amino_acids='AAGAVEWMYPTAALIVNLRPNTF', variant_aa_interval_start=11, variant_aa_interval_end=12, ends_with_stop_codon=False, frameshift=False): 29 alt reads supporting protein sequence (gene names = set([u'METRNL']))
2017-10-25 12:24:59,861 - isovar.protein_sequences:305 - INFO - TranslationKey(amino_acids='AAGAVEWMYPTAALIVNLRPNTF', variant_aa_interval_start=11, variant_aa_interval_end=12, ends_with_stop_codon=False, frameshift=False): protein sequence = AAGAVEWMYPTAALIVNLRPNTF

It does seem like both AAGAVEWMYPTAALIVNLRPNTF and MYPTAALIVNLRPNTF have the same number of reads, maybe there's no logic for when there's a tie in coverage?

iskandr commented 6 years ago

The second problematic case discovered is due to https://github.com/hammerlab/isovar/issues/55 -- which should be fixed when we switch to interbase coordinates for gathering locus reads.

iskandr commented 6 years ago

Possible fix: include protein sequence length as the 3rd sorting criterion in https://github.com/hammerlab/isovar/blob/master/isovar/protein_sequences.py#L164

iskandr commented 6 years ago

@scottdbrown -- do you think including protein sequence length as part of the sorting criteria would fix your issue? Or, is it altogether unexpected for one of the returned sequences to be a subsequence of another?

iskandr commented 6 years ago

@scottdbrown -- you redacted the cDNA sequence lengths for the two translation keys but do you mind posting those here?

scottdbrown commented 6 years ago

Here are the lengths of the redacted sections:

2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=2, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=23, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)

@iskandr -- Yes, I think sorting by length would fix the issue - the protein sequence that was output was the start of the shorter of the two transcripts, which was entirely contained within the longer transcript. The reads covered the region upstream of the shorter transcript (which is part of the longer transcript).

iskandr commented 6 years ago

Can you post the full cDNA seqs? I'm curious why these coding sequences didn't just get merged.

Thanks!

On Mon, Nov 27, 2017 at 4:02 PM, Scott Brown notifications@github.com wrote:

Here are the lengths of the redacted sections:

2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0 2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=2, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0) 2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0 2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=23, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)

@iskandr https://github.com/iskandr -- Yes, I think sorting by length would fix the issue - the protein sequence that was output was the start of the shorter of the two transcripts, which was entirely contained within the longer transcript. The reads covered the region upstream of the shorter transcript (which is part of the longer transcript).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hammerlab/isovar/issues/90#issuecomment-347326987, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC9OUlzumRb8CjkhGdCC_Sj0c9mrqiDks5s6yN9gaJpZM4QsHa1 .

scottdbrown commented 6 years ago

Sorry, due to privacy concerns with this sample, I'm not able to share any germline sequence. Line 1 and 2 appear to be identical to line 3 and 4 (including the sequence), aside from the "offset_to_first_complete_codon" value.

scottdbrown commented 6 years ago

Is there any other information I can provide that would help? I apologize for the inconvenience of not being able to provide the actual sequence.

scottdbrown commented 6 years ago

To get around this issue, I manually created the mutation of interest in a non-protected sequence file, and have attached all relevant files for you to hopefully be able to recreate the issue.

PYTHONPATH="/projects/sbrown_prj/bin/isovar/" python2 /projects/sbrown_prj/bin/isovar/script/isovar-protein-sequences.py --vcf COLO829_17_81042903_snv.pvcf --bam COLO829_17_81042903_reads_1mut.bam --min-alt-rna-reads 2 --protein-sequence-length 23 --output COLO829_17_81042903_isovar_180111.csv > COLO829_17_81042903_isovar_stdout_180111.txt

Input and output files can be found here: https://github.com/scottdbrown/isovar_COLO829_test