Closed thomasvangurp closed 7 years ago
The output of cluster_smallmem is inconsistent with that of Usearch,
usearch_8.0.1409_i86osx32 -cluster_smallmem /Volumes/5tb-deena/C101HW16120598/denovo/consensus.fa -id 0.95 -centroids /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.usearch.fa -sizeout -strand both usearch v8.0.1409_i86osx64, 17.2Gb RAM, 8 cores (C) Copyright 2013 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: thomasvangurp@gmail.com 00:23 45Mb 100.0% 36955 clusters, max size 68, avg 1.1 Seqs 41390 (41.4k) Clusters 36955 (37.0k) Max size 68 Avg size 1.1 Min size 1 Singletons 34069 (34.1k), 82.3% of seqs, 92.2% of clusters Max mem 45Mb Time 24.0s Throughput 1724.6 seqs/sec. grep -c NN /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.fa 80
We get 80 sequences which contain the sequence 'NN'. For vsearch this is not the case:
vsearch -cluster_smallmem /Volumes/5tb-deena/C101HW16120598/denovo/consensus.fa -id 0.95 -centroids /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.vsearch.fa -sizeout -strand both vsearch v2.4.3_osx_x86_64, 16.0GB RAM, 8 cores https://github.com/torognes/vsearch Reading file /Volumes/5tb-deena/C101HW16120598/denovo/consensus.fa 100% 5714888 nt in 41390 seqs, min 32, max 274, avg 138 Masking 100% Counting unique k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 36916 Size min 1, max 89, avg 1.1 Singletons: 34015, 82.2% of seqs, 92.1% of clusters grep -c NN /Volumes/5tb-deena/C101HW16120598/denovo/consensus_cluster.vsearch.fa 0
So, these 80 sequences are ignored. I need the sequences for epigbs consensus_cluster_usearch_output.fa.txt consensus.fa.txt
The difference comes from masking: vsearch performs masking by default (note the Masking 100% in the log). With the option -qmask none you should get the result you expect.
vsearch
Masking 100%
-qmask none
Ok, that solved it, thx!
The output of cluster_smallmem is inconsistent with that of Usearch,
We get 80 sequences which contain the sequence 'NN'. For vsearch this is not the case:
So, these 80 sequences are ignored. I need the sequences for epigbs consensus_cluster_usearch_output.fa.txt consensus.fa.txt