ksahlin / NGSpeciesID

Reference-free clustering and consensus forming of long-read amplicon sequencing
GNU General Public License v3.0
49 stars 14 forks source link

duplicate consensus sequences generated #25

Closed pecholleyn closed 1 year ago

pecholleyn commented 1 year ago

Hi, I have notice a problem on sequencing reads produced by amplicon sequencing (marker COI-5P). Indeed, in some cases, with samples usually having thousands of reads, NGSpeciesID outputs multiple (2 or 3) very similar consensus sequences (%99+ identity) that end up to same sequences once I clean them in the following steps of the pipeline. I would expect NGSpeciesID to produce only one cluster. You can find the reads for one of the problematic samples here.

NGSpeciesID (conda env) ran with the following options: NGSpeciesID --ont --consensus --medaka --m 709 --s 30 --medaka_model r1041_e82_260bps_sup_g632 --abundance_ratio 0.01

With such settings I obtain 3 consensus_reference (1, 362 and 485) (so before Medaka polishing), I aligned them with ClustalW. They are too similar to be in distinct clusters.

Am I doing something wrong?

ksahlin commented 1 year ago

Hi @pecholleyn ,

The final output of NGSpeciesID are in the folder(s) “medaka_cl_id_X where X is the cluster ID. Are there three of those folders in your dataset? If not, it could be that you have analysed sequences from step 2 (SPOA consensus), see Figure 1 here: https://onlinelibrary.wiley.com/doi/10.1002/ece3.7146 . These draft consensus will under go additional primer removal, merging and polishing after this.

If you are sure these are final medaka-polished final consensus sequences, I could run your data.

pecholleyn commented 1 year ago

I have the three folders medaka_cl_1, medaka_cl_362, medaka_cl_485 (I mentioned the 3 consensus_reference_X.fasta files in the original message, because in thought there were produced already after merging, my mistake). The consensus.fasta files in these folders are also 99+% identical against each other.

ksahlin commented 1 year ago

I have fixed the bug and uploaded the new version (v0.3.0) to PyPI and made a release (https://github.com/ksahlin/NGSpeciesID/releases/tag/v0.3.0) which describes the cause of this beghaviour.

This fix only requires to activate the NGSpeciesID environment and pull down the new version from pip since it did not change anything with the dependencies (spoa, racon, Medaka). That is, the steps:

  1. conda activate NGSpeciesID
  2. pip install -U NGSpeciesID

I have verified that the new fix works with your dataset (1 consensus formed) and that the two installation steps above also works). I will close the issue, let me know if you run inte further issues.