jpuritz / dDocent

a bash pipeline for RAD sequencing
ddocent.com
MIT License
53 stars 41 forks source link

CD-HIT runs forever / never finishes #70

Closed mscharmann closed 3 years ago

mscharmann commented 3 years ago

Hi, a recent run on a large dataset (both min thresholds set to 2) was stuck on the CD-HIT step; perhaps something like https://github.com/weizhongli/cdhit/issues/18

I suggest to replace CD-HIT in dDocent by linclust:

https://github.com/soedinglab/MMseqs2/wiki#linclust

I just tried it out and it will produce almost the same result and be much faster than CD-HIT. You can install it via conda:

conda install -c conda-forge -c bioconda mmseqs2

Best regards, Mathias

jpuritz commented 3 years ago

Generally, I don't recommend using full data sets to do reference assembly. See http://www.ddocent.com/quick/ for more info. Assemblies from subsets of the entire data set work best.

jpuritz commented 3 years ago

Sensitivity matters for this process, and I don't want to sacrifice accuracy for speed.