Closed mscharmann closed 3 years ago
Generally, I don't recommend using full data sets to do reference assembly. See http://www.ddocent.com/quick/ for more info. Assemblies from subsets of the entire data set work best.
Sensitivity matters for this process, and I don't want to sacrifice accuracy for speed.
Hi, a recent run on a large dataset (both min thresholds set to 2) was stuck on the CD-HIT step; perhaps something like https://github.com/weizhongli/cdhit/issues/18
I suggest to replace CD-HIT in dDocent by linclust:
https://github.com/soedinglab/MMseqs2/wiki#linclust
I just tried it out and it will produce almost the same result and be much faster than CD-HIT. You can install it via conda:
conda install -c conda-forge -c bioconda mmseqs2
Best regards, Mathias