phglab / ALFATClust

Biological sequence clustering tool with dynamic threshold
GNU General Public License v3.0
23 stars 6 forks source link

Dealing with Ns #10

Closed Sanrrone closed 11 months ago

Sanrrone commented 11 months ago

Dear phglab staff, Along with greeting you, I noticed that the software stop when the amount of N is more than 5% of the sequence. In my case the Ns are part of the assembly and help to have a complete genome instead of fragments. the error message:

Validating input sequence file 'tmp.fna'...
'KX452695.1': Less than 95% of sequence characters represent ordinary DNA bases
'KX452696.1': Less than 95% of sequence characters represent ordinary DNA bases
'KX452698.1': Less than 95% of sequence characters represent ordinary DNA bases
'GQ153916.1': Less than 95% of sequence characters represent ordinary DNA bases
'JN815249.1': Less than 95% of sequence characters represent ordinary DNA bases
'JQ245971.1': Less than 95% of sequence characters represent ordinary DNA bases
---------------------------------------------
Estimated similarity range = [0.95, 0.7]
Estimated similarity step size = 0.025
Default DNA k-mer size = 17
Default protein k-mer size = 9
Default DNA sketch size = 2000
Default protein sketch size = 2000
Min. estimated similarity considered = 0.5
Disable reverse complement for DNA = False
No. of threads = 8
---------------------------------------------

Is there a way to tell the software to continue anyway?

thanks you in advance!

jimmykhchiu commented 11 months ago

Hi @Sanrrone,

You may lower this 95% threshold by changing the value at line 4 in Constants.py https://github.com/phglab/ALFATClust/blob/58f3d896d08294cb36e17e40a53e5a3fb0cc7b05/main/modules/Constants.py#L4

Sanrrone commented 11 months ago

thanks you!, solved