synonymous mutations and the predicted sequence

MariusGheorghe commented 4 years ago

Hi Sebastian,

Hope all is well. As you already know, we're using Arriba in our pipeline and the more gene fusions data we gather through processing, the more questions we raise :]

We see that for gene fusions that happen in the middle of the codon, but the reading frame is preserved, Arriba reports a mutated amino-acid (in lower case), but in some cases this change could be synonymous. For instance, we came across this situation:

predicted protein sequence: dPLQDFGFSVEKCSKQLKSNINIRFGIILRE
wild-type sequence (based on GTF annotation): ...LYQSC**D**PLQDFGFSVEKCSKQLKSNINIRFGIILREDIKELF...

so, in this case we see that d is considered mutated, but in the wild-type, at that specific position we got also a D. The question is, how should we interpret this case? Is it a bug or a feature in Arriba? :] Can you please clarify a bit?

I have a few other questions, so I apologize in advance if I should've created separate issues for them, but I do not want to spam you :D

is there any limit on the length of the reported predicted sequence? Like a hardcoded value?
it is unclear to us how stop codons are shown. A * is often used, but is it because it is an early stop? Or because it is a natural stop? And why is the star missing in other cases? We would expect it to be always present.
is it possible to have the whole transcript sequence output (at nucleotide level)? this is more on the same note with having the entire predicted sequence output that we have already discussed a while ago.

Thank you very much for your time and again, sorry for the amount of questions but we want to make sure we interpret results correctly. Looking forward to your answers.

Have a good day! Marius

suhrig commented 4 years ago

Hi Marius,

Arriba reports a mutated amino-acid (in lower case), but in some cases this change could be synonymous

Synonymous mutations should not be reported in lowercase. That's how the code is written. It's a bug, if you see silent mutations being written in lowercase. There is one exception where amino acids may be written in lowercase, although the fusion peptide is identical to the wild-type sequence, namely when an amino acid is split by the fusion junction, such that at least one base originates from the 5' gene and the other bases (1 or 2) originate from the 3' gene. Such cases can be identified by the fact that the amino acid is adjacent to the pipe symbol (|) in the peptide sequence. Strictly speaking, it is not clear which gene the amino acid should be attributed to and therefore, it is always declared as mutant. My guess is that is what happened in your example, am I right?

is there any limit on the length of the reported predicted sequence? Like a hardcoded value?

No, not really. The limit is probably identical to the limit of C++ vectors, i.e. 2^32 or 2^64. I have not tested such long sequences, but there is no explicit limit in the code. Why do you ask? Do you suspect to have a limit somewhere?

it is unclear to us how stop codons are shown.

The asterisk * is used for both natural stops and premature ones. It may be absent sometimes, because the peptide sequence is translated from the transcript sequence, which in turn is assembled from the supporting reads. If the supporting reads do not reach the end of the coding sequence, the stop codon is missing the transcript sequence and thus will not be part of the translated sequence. It's the same issue as in your last point:

is it possible to have the whole transcript sequence output (at nucleotide level)?

Yes, once I have implemented your related request (full peptide sequence), Arriba will report both the whole transcript sequence as well as the whole peptide sequence. The latter builds upon the former so this has to be done anyway. Sorry, I have not yet found the time to do this. It's one of two big items still on my TODO list.

Don't worry about the number of questions. They are quick and easy to answer.

Regards, Sebastian

MariusGheorghe commented 4 years ago

Hi Sebastian,

Thank you for your quick and detailed reply/clarification, as always.

...Strictly speaking, it is not clear which gene the amino acid should be attributed to and therefore, it is always declared as mutant. My guess is that is what happened in your example, am I right?

I understand. It makes sense. I would guess that is what happened in our case. I am giving you a bit more information regarding this reported fusion, maybe you can confirm:

entire predicted gene fusion peptide sequence:
ASSLQRHMLTHT|dPLQDFGFSVEKCSKQLKSNINIRFGIILRE
transcript sequence at nucleotide level:
GGGCCAGCTCCCTACAGAGGCACATGCTCACACACACTG|ATCCATTACAAGATTTTGGCTTTTCTGTTGAAAAGTGTTCCAAGCAATTAAAATCAAATATCAACATTAGATTTGGAATTATTCTGA___GAGAGGAC

so we do have G|AT around the breakpoint which corresponds to a D in codon codes.

Do you suspect to have a limit somewhere?

Not really, it was more just to make sure. I did see a lot of short predicted sequences from the 3' transcript, so I thought you've implemented some usual arbitrary 9-mer/12-mer limit on it.

Arriba will report both the whole transcript sequence as well as the whole peptide sequence. The latter builds upon the former so this has to be done anyway. Sorry, I have not yet found the time to do this. It's one of two big items still on my TODO list.

Looking forward to it. No pressure :stuck_out_tongue: Thanks once again for your time. Really appreciate it!

Marius

suhrig commented 4 years ago

maybe you can confirm

Yes, I can confirm. It's the situation I described.

MariusGheorghe commented 4 years ago

Perfect. Thanks a bunch.

I'm closing this issue

suhrig commented 4 years ago

is it possible to have the whole transcript sequence output (at nucleotide level)?

Hi Marius, this is now implemented in the develop branch. But you need to enable it using the (repurposed) parameter -I. Otherwise, the default behavior is still to output the sequence assembled from the supporting reads, because it is more accurate. When -I is used, Arriba tries to fill gaps in the transcript sequence using information from the assembly. The information copied from the assembly is wrapped in parentheses. In most cases, the full fusion transcript sequence can be reconstructed this way. Please beware that there are some exceptional cases, where the full reconstruction is not possible. You can easily tell from the sequence string whether full reconstruction was successful or not: If it starts with a caret sign (^) full reconstruction of the 5' end was successful; if it ends on a dollar sign ($), full reconstruction of the 3' end was successful. Since the peptide sequence builds on the transcript sequence, this enhancement implicitly also enables full reconstruction of the fusion peptide sequence.

MariusGheorghe commented 4 years ago

Hi Sebastian,

Thank you very much for letting me know. That's great news! Will give it a try asap and let you know if I encounter any issues. Probably will give some feedback too :]

Have a good day and thanks again!

MariusGheorghe commented 4 years ago

Hi Sebastian,

Was trying out the new develop version and when trying to launch it with the command (the output of STAR is piped in):

    -x /dev/stdin \
    -o ${OUTPUT_FILE} \
    -O "${OUTPUT_FILE}.discarded" \
    -a ${ASSEMBLY_FA} \
    -g ${ANNOTATION_GTF} \
    -b ${BLACKLIST_TSV} \
    -I \
    -T \
    -P

I get the following error:

[2020-09-15T08:40:58] Launching Arriba 1.2.0
ERROR: invalid argument to -T

I looked in the documentation and indeed I see that the -T parameter now is not meant to populate the fusion_transcript column, so I guess having -I set replaces the previous "function" of -T. The same goes for -P, which was outputting the peptide sequence but now it is not recognized as a parameter. For the sake of completeness in this issue, I just have to remove those two flags and leave only -I correct?

Thanks again!

suhrig commented 4 years ago

I just have to remove those two flags and leave only -I correct?

Yes, correct. The transcript/peptide sequences and the read identifiers are always reported now, because in the past users forgot them too often and were confused that the respective columns were empty. As such, there is no need for the parameters -I, -T, and -P anymore. -I was repurposed to fill gaps in incomplete transcript sequences.

MariusGheorghe commented 4 years ago

Hi Sebastian,

Tested the new version of Arriba, looks like the nucleotide sequence of the transcripts is nicely put together. I've actually started writing code a short while ago to infer the transcript sequences myself. I already had 8 fusion transcripts that were "manually" assembled for validation purposes. I can confirm that 7/8 were a perfect match with what Arriba output, the other one had an SNP that differed from the transcript I've put together, but I would favor the output of Arriba as you had all the data in one place, so to say.

However, I've came across a (weird) predicted peptide sequence in the output of Arriba that I do not know yet how to interpret:

MALLAEHLLKPLPADKQIETGPFLEAVSHLPPFFDCLGSPVFTPIKADISGNITKIKAVYDTNPAKFRTLQNILEVEKEMYGAEWPKVGATLALMW
LKRGLRFIQVFLQSICDGERDENHPNLIRVNATKAYEMALKKYHGWIVQKIFQAALYAAPYKSDFLKALSKGQNVTEEECLEKIRLFLVNYTATID
VIYEMYTQMNAELNYKV*acplldtsptrghmekqanqnhcessr*tapglcppeqprvlpesagdsprflslfffffwsilkdq*imillpnll
gisrqhrltlylhfsvvktd*hf**vks*gflvlfccqrgtqlwgngvihlalkist*wlgtvahacnp|kgagsdvlisfnffffngeeqssfgnayt*

Why are there so many stop codons. In the documentation I see: " In order to determine, if the codons spanning the breakpoint are translated, one should verify that the peptide sequence does not contain a stop codon before the breakpoint" and what if it does? And several?

Can you please elaborate? Thanks a bunch again for implementing these features.

Marius

Update

Actually 8/8 manually assembled transcripts were a perfect match. The one that initially differed contained an error from my side.

suhrig commented 4 years ago

I can confirm that 7/8 were a perfect match with what Arriba output

Thanks for the feedback. That's excellent beta-testing! I could not have gotten more useful feedback.

the other one had an SNP

Was the SNP in your sequence or in Arriba's? Arriba still assembles as much as possible from the supporting reads, such that deviations from the reference (e.g., SNPs) are often still reflected in the sequence. This is useful, because the predicted sequence is closer to the true fusion sequence.

Why are there so many stop codons.

When a stop codon is found prior to the fusion junction, Arriba still reports all of the hypothetical codons until the junction for two reasons:

The fusion peptide is guaranteed to contains the pipe symbol (|) this way, which makes parsing easier.
The user can see the reason why the fusion does not give rise to a fusion protein. This is also reflected in the reading_frame column, which should contain the value stop-codon.

I have had a discussion with another user about these cases before. The user advocated that such fusion peptides should not be reported in the peptide_sequence column and that the column should rather be left empty, because most users do not care about such peptides. I defied this suggestion and introduced the new value stop-codon in the reading_frame instead. I'm curious about other opinions. What do you think would be best? Should the peptide_sequence column be empty when the reading_frame column reads stop-codon? After all, the information is somewhat redundant. Or is there still some use to having the hypothetical peptide sequence, for example, if a user questions Arriba's peptide prediction and wants to know why Arriba thinks that there is a stop codon? I'm a bit undecided. Considering that you are the second user to be confused by this, I am inclined not to report these sequences and leave curious users on their own in figuring out where the stop codon is.

MariusGheorghe commented 4 years ago

Hi Sebastian,

Glad I can help by providing feedback. That's the least I can do if I keep asking for stuff from you, no? :stuck_out_tongue:

Was the SNP in your sequence or in Arriba's? Arriba still assembles as much as possible from the supporting reads, such that deviations from the reference (e.g., SNPs) are often still reflected in the sequence. This is useful, because the predicted sequence is closer to the true fusion sequence.

I have double and triple checked and apparently for that specific transcript and more specifically for that SNP, I forgot to replace the reference nucleotide with the actual SNP for the final fused transcript sequence :expressionless: So, it was my mistake. Sorry about that. That's the beauty of human error. But to answer your question, the SNP was present in the 3' transcript's nucleotide sequence predicted by Arriba (previous develop version) which I then matched, using wildcards where SNPs occurred, to the wildtype sequence of the transcript. But putting back together the complete fused transcript I've overlooked that SNP

Regarding the multiple codons, yeah.. it's a tough call. My personal opinion is as follows: on the one hand, I tend to agree with the other user: if this peptide will not give rise to a protein, then the information is a bit misleading and probably useless. I would leave the column empty and mark it only with stop-codon or whatever is explicit enough. On the other hand, as you say, it might be informative for the user to see why there isn't any protein sequence predicted. What I would do, I would leave the peptide_sequence empty, but make sure that the reason why it's missing is detailed in the documentation. I understand also your point that it is easier to parse those entries having a | but then again, in our use case, the first filtering criteria is to drop all the entries that do not have a predicted peptide sequence. In this particular case, this entry will not be filtered out, but I will still have to parse it and then check if I got more than one stop codon in the sequence, and then discard it. So from the programmatic point of view, it produces overhead (albeit not a lot) . Then again, this is your tool, so you are entitled to do whatever you feel like with it :smirk:

I've asked also my colleagues what do they think it's best for those cases, so if their opinion differs from mine, I will post their feedback here. Hope it helps.

Marius

suhrig commented 4 years ago

I forgot to replace the reference nucleotide with the actual SNP for the final fused transcript sequence

No worries. I only wanted to know if the error was on Arriba's part, in which case I possibly could have done something about it.

What I would do, I would leave the peptide_sequence empty, but make sure that the reason why it's missing is detailed in the documentation

That settles it. No peptide sequence for fusions with a stop codon prior to the fusion junction! Thanks for your opinion.

MariusGheorghe commented 4 years ago

No worries. I only wanted to know if the error was on Arriba's part, in which case I possibly could have done something about it

Yeah, human eyes are awesome but when looking at sequences of hundreds of letters, they can let you down :grimacing:

That settles it. No peptide sequence for fusions with a stop codon prior to the fusion junction! Thanks for your opinion.

Hehehe.. No problem. I do have a colleague who expressed her opinion about letting those peptide sequences in so she can see what actually happened. And just filter out if there is stop-codon in the reading_frame column. She said she will post her comments/feedback here as she was in contact with you previously, but I do not know if she forgot or changed her mind.

This would be just to make the decision harder :stuck_out_tongue:

suhrig commented 4 years ago

If your colleague complains about this decision I will divert all the blame onto you. Hehe

MariusGheorghe commented 4 years ago

I shall take full responsibility :smirk:

suhrig / arriba

synonymous mutations and the predicted sequence #76