ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
174 stars 63 forks source link

Poly-A/poly-T soft clipping #157

Closed roomfortwo closed 3 years ago

roomfortwo commented 3 years ago

Hello there, I'm using novoplasty to recover some arthropod mitochondrial genomes from RNA-seq (quite trending these days) and I'm encountering that novoplasty tends to extend the poly-A tails from the reads (or poly-T in the 3'). Example below.

>Contig01+195792442 TTTTTTTTTTTTTTTTTTTTTTTTTTATCTAGATACAACTTCCAATATATCTACTTTGTTACGACTTATCTCTTTTTTCAGAGAGAGCGACGGGCGATATGTACATAATCTAGCCCTAATTCATTAGAATAAATTAATATCTAATTACAT ... CAACATAATTTTCCTCCAGCCGATCATACGTACAATCAATTAAGTGTTTCATCTTTATAAAAAAAAAAAAAAAAAAAAA

I tried to remove the polyA/polyT sections from the reads and failed mainly because trimmers tend to remove AT-rich biological sequences of interest. I was wondering if there's any flag I could implement in which I could either soft clip the polyA extensions or just ignore them in order to retrieve genomic seqs.

Thanks in advance, Luis AF.

ndierckx commented 3 years ago

Hi, I don't have experience with RNA-seq data, is the complete mitogenome covered by it, or just certain areas? Does it only extend with poly A/T at the ends when no further extension is possible , or you think further extension is possible? You can run with extended log to 1 and send me that file, can have a look

roomfortwo commented 3 years ago

Hi,

  1. I think the coverage of the mitogenome depends mainly on the RNA extraction methods (if you select mRNA based on poly-A tails for example) and random factors (like the expression of the mtDNA itself, mtDNA contamination),
  2. I think further extension is possible, but the reads to extend might be less abundant than the reads with polyadenilation, that's why I first thought of clipping the polyT/polyA.

Here you have three logs from assemblies of the same library (SRR3458647), different seeds, the RNA from this library was selected for poly-A tails.

log_extended_hans-3-k29.txt log_extended_node49-k29.txt log_extended_scu-b-k29.txt

This in the other hand is a log from a whole RNA sequencing library (I have a bigger logs but they are too heavy), with this library I assembled the whole mtDNA. log_extended_mtDNA-frl2015-39.txt

Let me know what you think, cheers,

ndierckx commented 3 years ago

How long are those mitochondrial genomes you try to assemble?

And the example of your first comment is not one of those 3 you send I guess?

I don't think it is always polyA tail problem, the coverage seems to go to almost 0 before the polyA starts, so not sure if it can extend further And I think some are just A-repeats, looks like the data has a strong error rate and reduced coverage around these regions.

If you have a specific sequence or polA you want me to check i will do, but I can't check all files completly

roomfortwo commented 3 years ago

Hey, I'm trying to assemble 14k genomes and the "strong error rate" you mention I believe are the NUMTs. If you think Novoplasty it's not extending due to low coverage and the strong variation, I'll take it.

Thank you so much for you insight!

ndierckx commented 3 years ago

Hi, NUMTs are not the problem, I usually doesn't affect the assembly. I mean strong error rate around single nucleotide repeats, depends on the library preparation and illumina kit you used.