cluster_smallmem ignores sequences that contain unknown nucleotides

thomasvangurp commented 7 years ago

The output of cluster_smallmem is inconsistent with that of Usearch,

usearch_8.0.1409_i86osx32 -cluster_smallmem /Volumes/5tb-deena/C101HW16120598/denovo/consensus.fa -id 0.95 -centroids /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.usearch.fa -sizeout -strand both
usearch v8.0.1409_i86osx64, 17.2Gb RAM, 8 cores
(C) Copyright 2013 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: thomasvangurp@gmail.com

00:23  45Mb  100.0% 36955 clusters, max size 68, avg 1.1

      Seqs  41390 (41.4k)
  Clusters  36955 (37.0k)
  Max size  68
  Avg size  1.1
  Min size  1
Singletons  34069 (34.1k), 82.3% of seqs, 92.2% of clusters
   Max mem  45Mb
      Time  24.0s
Throughput  1724.6 seqs/sec.

grep -c NN /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.fa
80

We get 80 sequences which contain the sequence 'NN'. For vsearch this is not the case:

vsearch -cluster_smallmem /Volumes/5tb-deena/C101HW16120598/denovo/consensus.fa -id 0.95 -centroids /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.vsearch.fa -sizeout -strand both
vsearch v2.4.3_osx_x86_64, 16.0GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file /Volumes/5tb-deena/C101HW16120598/denovo/consensus.fa 100%  
5714888 nt in 41390 seqs, min 32, max 274, avg 138
Masking 100%  
Counting unique k-mers 100%  
Clustering 100%  
Sorting clusters 100%
Writing clusters 100%  
Clusters: 36916 Size min 1, max 89, avg 1.1
Singletons: 34015, 82.2% of seqs, 92.1% of clusters

 grep -c NN /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.vsearch.fa 
0

So, these 80 sequences are ignored. I need the sequences for epigbs consensus_cluster_usearch_output.fa.txt consensus.fa.txt

frederic-mahe commented 7 years ago

The difference comes from masking: vsearch performs masking by default (note the Masking 100% in the log). With the option -qmask none you should get the result you expect.

thomasvangurp commented 7 years ago

Ok, that solved it, thx!

torognes / vsearch

cluster_smallmem ignores sequences that contain unknown nucleotides #254