Open iskandr opened 6 years ago
The second problematic case discovered is due to https://github.com/hammerlab/isovar/issues/55 -- which should be fixed when we switch to interbase coordinates for gathering locus reads.
Possible fix: include protein sequence length as the 3rd sorting criterion in https://github.com/hammerlab/isovar/blob/master/isovar/protein_sequences.py#L164
@scottdbrown -- do you think including protein sequence length as part of the sorting criteria would fix your issue? Or, is it altogether unexpected for one of the returned sequences to be a subsequence of another?
@scottdbrown -- you redacted the cDNA sequence lengths for the two translation keys but do you mind posting those here?
Here are the lengths of the redacted sections:
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=2, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=23, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
@iskandr -- Yes, I think sorting by length would fix the issue - the protein sequence that was output was the start of the shorter of the two transcripts, which was entirely contained within the longer transcript. The reads covered the region upstream of the shorter transcript (which is part of the longer transcript).
Can you post the full cDNA seqs? I'm curious why these coding sequences didn't just get merged.
Thanks!
On Mon, Nov 27, 2017 at 4:02 PM, Scott Brown notifications@github.com wrote:
Here are the lengths of the redacted sections:
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0 2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=2, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0) 2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0 2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=23, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
@iskandr https://github.com/iskandr -- Yes, I think sorting by length would fix the issue - the protein sequence that was output was the start of the shorter of the two transcripts, which was entirely contained within the longer transcript. The reads covered the region upstream of the shorter transcript (which is part of the longer transcript).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hammerlab/isovar/issues/90#issuecomment-347326987, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC9OUlzumRb8CjkhGdCC_Sj0c9mrqiDks5s6yN9gaJpZM4QsHa1 .
Sorry, due to privacy concerns with this sample, I'm not able to share any germline sequence. Line 1 and 2 appear to be identical to line 3 and 4 (including the sequence), aside from the "offset_to_first_complete_codon" value.
Is there any other information I can provide that would help? I apologize for the inconvenience of not being able to provide the actual sequence.
To get around this issue, I manually created the mutation of interest in a non-protected sequence file, and have attached all relevant files for you to hopefully be able to recreate the issue.
PYTHONPATH="/projects/sbrown_prj/bin/isovar/" python2 /projects/sbrown_prj/bin/isovar/script/isovar-protein-sequences.py --vcf COLO829_17_81042903_snv.pvcf --bam COLO829_17_81042903_reads_1mut.bam --min-alt-rna-reads 2 --protein-sequence-length 23 --output COLO829_17_81042903_isovar_180111.csv > COLO829_17_81042903_isovar_stdout_180111.txt
Input and output files can be found here: https://github.com/scottdbrown/isovar_COLO829_test
Issue reported by email from Scott Brown:
...
STDOUT from isovar invocation:
It does seem like both
AAGAVEWMYPTAALIVNLRPNTF
andMYPTAALIVNLRPNTF
have the same number of reads, maybe there's no logic for when there's a tie in coverage?