suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
226 stars 50 forks source link

Translation function from arriba #23

Closed BrunoFant closed 5 years ago

BrunoFant commented 5 years ago

Hello Sebatian,

I am writing to you because I have started using arriba (beautiful tool BTW) and the output of the translation feature of Arriba (called by the -P parameter) is a bit unclear to me.

As per the documentation, If the fusion transcript contains an ellipsis (...), the sequence beyond the ellipsis is trimmed before translation, because the reading frame cannot be determined reliably. A very sensible choice, and for my output it is indeed the case for the part after the fusion. However, the output of the part before the fusion confuses me. As examples, here are two events I find:

CCCTTTGGACCTTTGgCACCAGGCTGGG___AAAAAAGAGTGGATTCAACAGACAGGGTTTACTTTGTGAATCATAACA...GGAAGATCCAAGAACTCAAGG|ATAAAAGATCTGCAGCTATGGAATTCTTCTCCATGACATTTTCACAGAACATACTACTTGTGATTTATATCATGTCCTTACTCAAG___GTAAGGAACTGCAAGTGATCAATATTGC

gives

EDPRTQg|*

And

AAAGAAGACTGGGCCTACAAAGAAGAAAGTGAAAGAACTGAGAATTTTGG...ATCGCATGCCATATGAAGACATAAGAAACGTTATTCTGGAGGTTAATGAAGACATGCTGAGTGAGGCTTTAATTCAG|ATATGTTTAAAGGGTAAGGTGCAC...CACAGCCTCTCACAGACAG___TATGGAAGATTTTTATCCAAATAAAAATCATGGCCCT

gives

RMPYEDIRNVILEVNEDMLSEALIQ|ICLKGKV

I gather that in those cases (ellipsis in the first part of the fusion), arriba only translates from the last ellipsis of the first transcript part onwards, but how so ? Do you translate a full reconstruction of the CDS from the assembly and the gtf, and then only take the appropriate part (which I'm guessing is the case from the doc: Translation starts [...] when the start codon is encountered in the 5' gene. )?If so, how do you handle alternate start codons ? Does this translation method only happen when you are certain of the transcript start / transcript ID ?

Additionally, I have events looking like this:

GCGGATCTGGGGCCGTCCTCAG|CATAAGCTGTGGCCATGACTACTGAAGT...GGACTCTAGCCAGTTAGGAACAGATGCAACCAAGGAAAAACCTAAAGAAG

translated like that:

ADLGPSS|a*

Which I'm guessing is the other behavior of the translation tool described in the doc: Translation starts at the start of the assembled fusion transcript

My question here is then how do you choose which translation method to use ? Do you check whether the natural (annotated) 5' start codon is here and translate from it if you can, from the start of the transcript if not ? In addition, translation from the start of the transcript seems like it lacks some biological relevance - why not translate from the first encountered start codon ?

I'm sorry for the wall of text and the numerous questions but the translation output of arriba is the part I am most interested in, so I figured I would try to clarify my confusion.

Thanks in advance, and thanks for a great tool too !

Bruno

suhrig commented 5 years ago

Hi Bruno, I will answer your questions one by one:

arriba only translates from the last ellipsis of the first transcript part onwards

Correct. GGAAGATCCAAGAACTCAAGG translates to EDPRTQg. The part before the ellipsis is ignored.

Do you translate a full reconstruction of the CDS from the assembly and the gtf, and then only take the appropriate part

No, the sequence of the reference assembly is not used as a template for translation. Arriba uses the sequence of the actual reads supporting a fusion as a template (the observed sequence). This way, SNPs, SNVs, indels, and non-template bases are reflected in the translated sequence. Especially non-template bases are quite often relevant to the reading frame. These are a few bases (typically 1-20) that are inserted between the fusion breakpoints. Occasionally, they rescue the reading frame, when the fusion would be out-of-frame if exons were fused directly.

Similarly, the GTF file is only used as guidance to determine the transcript sequence. Arriba does NOT simply take the sequences of exons annotated in the GTF file and translate those. When intronic regions are transcribed, such an approach would lack the intronic sequence. Transcription of a few intronic bases is common phenomenon with EML4-ALK fusions, for instance, and can also rescue the reading frame.

In a nutshell, here is what happens:

  1. Arriba assembles the observed transcript sequence from the supporting reads of a fusion.
  2. It checks which GTF-annotated transcript best explains the splicing pattern of the fusion-supporting reads. This transcript defines the reading frame.
  3. Arriba looks for bases where the fusion-supporting reads overlap with a coding exon of the annotated transcript in order to determine the reading frame of the first base of the observed transcript sequence.
  4. From here on, the GTF annotation does not play a role anymore. The sequence is simply translated in triplets just like a ribosome would do, until a stop codon is encountered.

If so, how do you handle alternate start codons ?

If a gene has multiple transcripts, Arriba picks the one that best fits the splice pattern suggested by the reads supporting a fusion. If there are still multiple candidates left, it picks the longest transcript. Note that in most situations, it does not even matter, which transcript you choose, because most of them have the same reading frame. I have seen few situations, where it made a difference and by using the transcript that best matches the splice pattern Arriba does a decent job at picking the correct one.

Does this translation method only happen when you are certain of the transcript start / transcript ID ?

Translation happens, when

Arriba does not require that there be an unambiguous transcript ID. As stated above, the longest transcript is chosen if there are multiple candidates that all fit the splice pattern equally well.

Which I'm guessing is the other behavior of the translation tool described in the doc: Translation starts at the start of the assembled fusion transcript Do you check whether the natural (annotated) 5' start codon is here and translate from it if you can, from the start of the transcript if not ?

Yes, translation starts either with the first base of the observed fusion sequence or - if the observed fusion sequence extends even before the start codon of the 5' gene - with the start codon of the 5' gene.

In addition, translation from the start of the transcript seems like it lacks some biological relevance - why not translate from the first encountered start codon ?

Because the supporting reads of a fusion often do not cover the start of the 5' partner. The fraction of a gene covered by supporting reads is limited by the insert size of the library. So usually, you only see ~200 bases +/- from the breakpoint. It is reasonable to assume that transcription (and translation) starts earlier - namely at the natural start codon. I do not see why you think this lacks biological relevance. The alternative hypothesis would be that translation starts at the methionine closest to the breakpoint, which does not seem as plausible as the natural start codon IMO.

but the translation output of arriba is the part I am most interested in

FYI: I discovered a bug today where a fusion is classified as in-frame, even though it isn't. This happens in rare situations and only when the anti-sense strand of the 3' gene is transcribed. I am working on a fix that I will push with the next release. So if the reading_frame column is essential for your research question, then you should check that the column strand2 is not -/+ or +/- to be sure that the fusion is really in-frame, if the reading_frame column says so. The translated sequence is not affected by this bug, however (only capitalization, i.e., some amino acids in the 3' part are written in upper case, even though they should be lower case).

Cheers, Sebastian

BrunoFant commented 5 years ago

Hi Sebastian, and thank you very much for your quick answer!

I understand a lot better now. I was (wrongly) working under the assumption that the reconstituted transcript given out by Arriba was aiming for a total reconstruction of an actual transcript/CDS, start to finish, which skewed a lot of my observations - mainly my point about looking for a start codon. The way it works is much clearer to me now, and it indeed makes a lot of sense.

Just so that I'm not working under any misconceptions again, I will rephrase what you said, and please feel free to correct me if I'm still wrong: after reconstructing the transcript structure around the fusion by using reads supporting said fusion, Arriba searches for a match (I assume the earliest possible, given all restrictions?) between the reading frame of the right "normal" transcript and the fusion transcript, thus establishing the reading frame of the latter. It then translates the latter until a stop, and that is the output.

I guess I just have one more question, which is more out of curiosity than anything : why does Arriba only start looking for stop codons in the 3' fusion side? At least that is what it says in the documentation, I haven't encountered that case yet.

Thabks again for a very thorough answer, and a great tool, Bruno

Le jeu. 11 juil. 2019 à 20:31, suhrig notifications@github.com a écrit :

Hi Bruno, I will answer your questions one by one:

arriba only translates from the last ellipsis of the first transcript part onwards

Correct. GGAAGATCCAAGAACTCAAGG translates to EDPRTQg. The part before the ellipsis is ignored.

Do you translate a full reconstruction of the CDS from the assembly and the gtf, and then only take the appropriate part

No, the sequence of the reference assembly is not used as a template for translation. Arriba uses the sequence of the actual reads supporting a fusion as a template (the observed sequence). This way, SNPs, SNVs, indels, and non-template bases are reflected in the translated sequence. Especially non-template bases are quite often relevant to the reading frame. These are a few bases (typically 1-20) that are inserted between the fusion breakpoints. Occasionally, they rescue the reading frame, when the fusion would be out-of-frame if exons were fused directly.

Similarly, the GTF file is only used as guidance to determine the transcript sequence. Arriba does NOT simply take the sequences of exons annotated in the GTF file and translate those. When intronic regions are transcribed, such an approach would lack the intronic sequence. Transcription of a few intronic bases is common phenomenon with EML4-ALK fusions, for instance, and can also rescue the reading frame.

In a nutshell, here is what happens:

  1. Arriba assembles the observed transcript sequence from the supporting reads of a fusion.
  2. It checks which GTF-annotated transcript best explains the splicing pattern of the fusion-supporting reads. This transcript defines the reading frame.
  3. Arriba looks for bases where the fusion-supporting reads overlap with a coding exon of the annotated transcript in order to determine the reading frame of the first base of the observed transcript sequence.
  4. From here on, the GTF annotation does not play a role anymore. The sequence is simply translated in triplets just like a ribosome would do, until a stop codon is encountered.

If so, how do you handle alternate start codons ?

If a gene has multiple transcripts, Arriba picks the one that best fits the splice pattern suggested by the reads supporting a fusion. If there are still multiple candidates left, it picks the longest transcript. Note that in most situations, it does not even matter, which transcript you choose, because most of them have the same reading frame. I have seen few situations, where it made a difference and by using the transcript that best matches the splice pattern Arriba does a decent job at picking the correct one.

Does this translation method only happen when you are certain of the transcript start / transcript ID ?

Translation happens, when

  • it is (fairly) clear which gene is the 5' partner and which the 3' partner, which (among other things) requires that the transcribed strands could be determined
  • the 5' partner is a coding gene and parts of the coding region are part of the fusion (i.e., the breakpoint is not in the 5' UTR)
  • the reading frame can be determined, i.e., at least one base of the supporting reads overlaps with a coding exon

Arriba does not require that there be an unambiguous transcript ID. As stated above, the longest transcript is chosen if there are multiple candidates that all fit the splice pattern equally well.

Which I'm guessing is the other behavior of the translation tool described in the doc: Translation starts at the start of the assembled fusion transcript Do you check whether the natural (annotated) 5' start codon is here and translate from it if you can, from the start of the transcript if not ?

Yes, translation starts either with the first base of the observed fusion sequence or - if the observed fusion sequence extends even before the start codon of the 5' gene - with the start codon of the 5' gene.

In addition, translation from the start of the transcript seems like it lacks some biological relevance - why not translate from the first encountered start codon ?

Because the supporting reads of a fusion often do not cover the start of the 5' partner. The fraction of a gene covered by supporting reads is limited by the insert size of the library. So usually, you only see ~200 bases +/- from the breakpoint. It is reasonable to assume that transcription (and translation) starts earlier - namely at the natural start codon. I do not see why you think this lacks biological relevance. The alternative hypothesis would be that translation starts at the methionine closest to the breakpoint, which does not seem as plausible as the natural start codon IMO.

but the translation output of arriba is the part I am most interested in

FYI: I discovered a bug today where a fusion is classified as in-frame, even though it isn't. This happens in rare situations and only when the anti-sense strand of the 3' gene is transcribed. I am working on a fix that I will push with the next release. So if the reading_frame column is essential for your research question, then you should check that the column strand2 is not -/+ or +/- to be sure that the fusion is really in-frame, if the reading_frame column says so. The translated sequence is not affected by this bug, however (only capitalization, i.e., some amino acids in the 3' part are written in upper case, even though they should be lower case).

Cheers, Sebastian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/23?email_source=notifications&email_token=AFN7BYKAABRHR3VSDBYVOTLP65367A5CNFSM4IBJ7OXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZXSYCQ#issuecomment-510602250, or mute the thread https://github.com/notifications/unsubscribe-auth/AFN7BYITRS4DDPJTZY6FSA3P65367ANCNFSM4IBJ7OXA .

suhrig commented 5 years ago

Hi Bruno,

I will rephrase what you said, and please feel free to correct me if I'm still wrong

All correct.

Arriba only start looking for stop codons in the 3' fusion side?

Stop codons are represented as an asterisk (*). I'd be surprised if you really hadn't seen any yet. What the documentation is trying to tell you is that Arriba does not stop translation when it encounters a stop codon in the 5' part. It only stops when the stop codon is in the 3' part. This means that the amino acids in the 3' end are really only translated if there is no stop codon in the 5' part. You will need to filter fusions with a * before the | to get true fusion proteins.

I agree that this may not be intuitive. I made this design decision, because there would be no pipe (|) in the peptide sequence if Arriba stopped translating upon hitting a stop codon in the 5' part. This in turn may also be irritating to the user. But maybe I should change this behavior. To get some feedback on a real-life use case, may I ask you about your opinion? Would you prefer Arriba to stop translation in the 5' end after hitting a stop codon or should it continue until at least the fusion breakpoint, such that the pipe is guaranteed to be contained in the peptide sequence? Would you think that the lack of a pipe could be confusing to users?

Regards, Sebastian

BrunoFant commented 5 years ago

Hi again Sebastian!

Sorry if I wasn't clear in my follow - up question, but you still addressed it very clearly in your answer. I have indeed encountered stop codons in the translated sequences in my Arriba output files, but I have yet to encounter a stop codon in the 5' part of the fusion. This is what I was trying to convey. I apologize for any misunderstanding that I might have caused.

In terms of real-life use, I am mainly interested on the sequence of the proteins that could be produced by gene fusions, as they are, all things considered, "new" or non-self proteins. As such, translated sequences that have a stop codon before the fusion do not interest me, as they belong to the "normal" proteome.

In more general terms, I suspect that people interested in the actual protein sequence of gene fusion proteins will mainly want protein products that actually span the fusion. I currently fail to see uses where one might need the actual theoretical protein sequence if a stop occurs before the fusion within the transcript. As an end-user, either translating up to the first stop codon (in which case, intuitively, a quick awk/grep on the protein sequence field would rule out the ones that don't present the "|" fusion symbol) or adding a column indicating where the stop is located compared to the fusion (same strategy, grep/awk on that field's characteristics) would greatly improve my use of Arriba's output. After all, the pipe would still be there in the transcript output (and therefore still usable for people who want to search on it) and you would reflect the functional protein output with more fidelity (?). I acknowledge that this rules out commands that search the protein field specifically for a "|". But my guess is that programs using that search then mostly rule out results which have a stop in the pre-"|" section.

Tl;dr: the pipe is probably not needed in the protein sequence, or at least its absence can be easily filtered out.

Thanks again for your prompt answer, Best, Bruno

Le jeu. 11 juil. 2019 à 23:44, suhrig notifications@github.com a écrit :

Hi Bruno,

I will rephrase what you said, and please feel free to correct me if I'm still wrong

All correct.

Arriba only start looking for stop codons in the 3' fusion side?

Stop codons are represented as an asterisk (). I'd be surprised if you really hadn't seen any yet. What the documentation is trying to tell you is that Arriba does not stop translation when it encounters a stop codon in the 5' part. It only stops when the stop codon is in the 3' part. This means that the amino acids in the 3' end are really only translated if there is no stop codon in the 5' part. You will need to filter fusions with a before the | to get true fusion proteins.

I agree that this may not be intuitive. I made this design decision, because there would be no pipe (|) in the peptide sequence if Arriba stopped translating upon hitting a stop codon in the 5' part. This in turn may also be irritating to the user. But maybe I should change this behavior. To get some feedback on a real-life use case, may I ask you about your opinion? Would you prefer Arriba to stop translation in the 5' end after hitting a stop codon or should it continue until at least the fusion breakpoint, such that the pipe is guaranteed to be contained in the peptide sequence? Would you think that the lack of a pipe could be confusing to users?

Regards, Sebastian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/23?email_source=notifications&email_token=AFN7BYLOTHJP6DMPZJZANVLP66SUXA5CNFSM4IBJ7OXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZYCIXY#issuecomment-510665823, or mute the thread https://github.com/notifications/unsubscribe-auth/AFN7BYKI62U6PDQPDCQCBGTP66SUXANCNFSM4IBJ7OXA .

BrunoFant commented 5 years ago

To add to that, an indication of which "normal" transcript is supported by the reads and used as a reading frame reference would also be great !

suhrig commented 5 years ago

Hi Bruno,

As such, translated sequences that have a stop codon before the fusion do not interest me

Thanks for the feedback. I agree with you and I think I won't report peptide sequences with a stop codon before the pipe anymore in the next release of Arriba.

a quick awk/grep on the protein sequence field would rule out the ones that don't present the "|" fusion symbol

Note that it is currently also possible to remove peptide sequences with a stop codon before the pipe by using grep -v -P '\*.*\|'. In addition, you might want to remove peptide sequences with a stop codon right after the pipe, i.e., grep -v -F '|*'.

To add to that, an indication of which "normal" transcript is supported by the reads and used as a reading frame reference would also be great !

I might add such a column in the future. Probably, this won't happen before the next major release, though, because I like to refrain from making substantial changes to the output format.

Thanks again for the feedback. It is appreciated. I will close this issue for now. Feel free to re-open if you have further questions.

Regards, Sebastian