soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
132 stars 14 forks source link

Use PLASS in metatranscriptomic data #35

Open jjsanchezgil opened 3 years ago

jjsanchezgil commented 3 years ago

Hi

I have 2x150 bp metatranscriptomic reads (prokaryotic) and I'd like to use PLASS to assemble proteins. Should it be used the same way as for a metagenome? In this case most reads should be translated directly into their protein sequence, and containing a start or a stop codon does not seem so crucial to evaluate if the read belongs to a gene or not, as they will come from genes anyway. Perhaps the beginning of a gene in this case should consider the Shine-Dalgarno sequence + start codon (or an upstream stop codon if the gene is inside a polycistron). What would be the best way to apply PLASS to metatranscriptomes? Should I start by translating whole reads in all frames?

Thank you

milot-mirdita commented 3 years ago

Generally, it should just work if you provide the paired end reads to plass.

Members of our team are working on extending Plass so it works better on metatranscriptomes, however that work is not public yet. I'll ask them to comment on this issue.

jjsanchezgil commented 3 years ago

Hi! Thanks a lot for your answer and happy to hear that there's work on this! I can imagine it's then delicate to write here a full response on the issue, but I would be very happy to have some hints (if possible). Thanks!

LouisPwr commented 3 years ago

Hi! As mentioned above, plass should work perfectly fine with your paired end reads as input. One thing to consider is that you probably get highly redundant sequences. At least this is what we are observing with transcriptomic data. To overcome this issue you can cluster your results using mmseqs2 (also available in the soedinglab repository). In particular choose the easy-linclust workflow and set the parameter --min-seq-id [float] to e.g. 0.97. This will reduce the redundancy already by a lot.

Another option is to assemble your transcripts first on nucleotide level with a transcriptome assembler (e.g. Trinity, which performs extremely well on rna seq data). After assembling on nucleotide level, you can convert the transcripts in amino acid sequences. This might be slightly more specific for metatranscriptome datasets. Right now, we are working on a similar tool like Trinity, which later can be used for metatranscriptome assembly and should perform at least as good as Trinity.

But since we are in an early stage of development, the options above are the only ones at the moment. Depending on your objective using PLASS might be the faster option, whereas the transcriptome assembly could give you a deeper insight in alternative splicing isoforms. I am happy to hear if I could help you out or if something is unclear!

jjsanchezgil commented 3 years ago

Hi @LouisPwr! Wow thanks a lot for your help! We are now running Trinity. When you say to transform Trinity's output to amino acid, do you mean to use the contigs as input for PLASS or continuing outside? Do you think running both could help in identifying more genes or having better resolution? Thanks again and looking forward to seeing your next work!

LouisPwr commented 3 years ago

Hi @jjsanchezgil! I asked another colleague, who has more experience with transcriptomes, and he told me the following: The best two options are: 1) The "standard" peptide prediction pipeline for transcriptomic data is using trinity (for assembly) and "Transdecoder" (for predicting coding sequences and getting the peptide sequences; can be replaced by "GeneMarkS-T" or "CodAn"). 2) Since PLASS is an assembler, I wouldn't use the contigs from trinity as an input for PLASS, because the assembly was already done by Trinity in that particular case. Better use your paired end reads as an input for PLASS (for assembly and conversion into peptide sequences) and cluster the sequences afterwards with mmseqs2. These two pipelines haven't been compared yet, but we assume that they perform similar. So it might be worth trying both options and comparing the results, maybe first with a small dataset.

Hope we could help!