Advice on tool for amplicon sequencing

ksahlin / NGSpeciesID

Reference-free clustering and consensus forming of long-read amplicon sequencing

GNU General Public License v3.0

49 stars 14 forks source link

Advice on tool for amplicon sequencing #4

Closed mlosilla closed 3 years ago

mlosilla commented 3 years ago

Hi,

I have looked at several tools from your repo, thank you very much for making them available.

I can't figure out which tool is better suited for my data: I amplicon-sequenced a ~2000 bp locus with ONT, depth >1000x. There could be one, two, or three (or more?) copies (duplications) of my target locus, it depends on the species. I need to 1) figure out how many copies of my target locus are there on my sample, and 2) get a consensus sequence for each copy.

Which tool would you recommend: IsoCon, isONcorrect, isONclust, isOnclust with --consensus, NGSpeciesID? a combination of some? or something else?

Thank you Mau

ksahlin commented 3 years ago

Hi Mau,

Thanks!

It depends on the sequence identity of your duplications. If they are fairly divergent, use NGSpeciesID. If they are not, try IsoCon. I would add the note that if you are considering using IsoCon, which was developed for CCS reads with low error rate, I would consider running isONcorrect first to error correct the reads. You can ignore isONclust with consensus option since it is the draft implementation of NGSpeciesID. As far as other tools, I'm not aware of any tools immediately applicable for amplicon-sequenced ONT data, especially if you want to do reference-free analysis -- which is probably best unless your reference genome has all or most of the copies.

So in summary: If high mutation rate b/t copies: NGSpeciesID If highly similar copies: IsoCon, or (isONcorrect + IsoCon)

Let me know how it goes.

Best, K

mlosilla commented 3 years ago

Hi Kristoffer,

Thank you very much for your reply. Yes, I rather do reference-free analysis-- I only have reference sequences for one species, but I plan to analyze several.

Regarding your two suggested approaches, what would you say a copy similarity threshold would be? For the one species I have reference sequences, there are 3 copies. The pairwise % identities are:

copy 1 vs copy 2: 88.5% copy 1 vs copy 3: 88.8% copy 2 vs copy 3: 97% Also, copy 1 has a ~100 bp deletion, but I don't know if this will be the case in other species.

Thanks! Mau

ksahlin commented 3 years ago

For the 88.5% copies I would be fairly confident that NGSpeciesID could work, given that the mutations are somewhat evenly distributed across sequences and not just a big indel in the end of one sequence (leading to the lower identity). However 97% is not that much. If the mutations were very evenly distributed and point mutations, it could work, but I'm not optimistic about it.

How many reads do you have? If not too many, I would consider running only IsoCon from scratch. IsoCon is fairly slow compared to the other tools and performs best if barcodes/primers at ends of reads are trimmed.