phglab / ALFATClust

Biological sequence clustering tool with dynamic threshold
GNU General Public License v3.0
23 stars 6 forks source link

AlfatClust cant locate mash even after specifying path #6

Closed sanjanabhatnagar closed 1 year ago

sanjanabhatnagar commented 1 year ago

Hi,

I have been trying to run alfatclust but I keep getting following error.


Estimated similarity range = [0.95, 0.75] Estimated similarity step size = 0.025 Default DNA k-mer size = 7 Default protein k-mer size = 9 Default DNA sketch size = 2000 Default protein sketch size = 2000 Min. estimated similarity considered = 0.55 No. of threads = 8

Validating input sequence file 'oRNAconsensusPWM.fa'... Estimating pairwise sequence distances...

Process aborted due to error occurred: [Errno 2] No such file or directory: 'mash'

jimmykhchiu commented 1 year ago

Hi @sanjanabhatnagar ,

May I know your clustering command using ALFATClust?

Also if you set up ALFATClust on your own rather than using the Docker/Conda approach suggested, make sure mash is installed and can be located via your PATH variable. You may type which mash to know whether it can be located or not.

Jimmy

sanjanabhatnagar commented 1 year ago

Hi Jimmy,

Thanks a lot for your prompt response! I really appreciate it.

You are right. When I tried conda, it worked. I guess I didn't know how to add mash path to system's path/PATH variable. However, it is running smoothly now. I just had a question about runtime.

I am trying to cluster 455 heptamers (motifs) so I don't think sketch size of 2000 is the right choice for me. Can you suggest if I can cluster my short motifs at all with ALFATClust and what sketch size should I be going for in this case?

Thanks, Sanjana

jimmykhchiu commented 1 year ago

Hi Sanjana,

It seems you are clustering DNA/RNA sequences with 7 bps only (heptamers). If I get this right, then your choice of k should be even lower than 7 (shown in the parameter list above) as mash needs to extract k-mers from each of your input sequences for min-hash computation. However, this means the k-mer space is very limited and the mash distances would become much underestimated when compared with the actual pairwise sequence distances (due to some random k-mer matches). Please advise the approximate sequence length if my guess is wrong.

You can find the details of sketches here: https://mash.readthedocs.io/en/latest/sketches.html.

Jimmy

sanjanabhatnagar commented 1 year ago

Hi Jimmy,

I apologize for a slow response. I have been looking into mash k-mer and sketch sizes. Thanks for sharing the link as well. I figured that k-mer size for computations should be less than 7 nts and have been working with shorter k-mers. And you rightly pointed that there might be a chance of random k-mers getting grouped together.

Also, I guess I had a slightly different question. As I am working with RBP motifs, is there a way I could turn off the reverse complement function? In my case I saw reverse complements getting grouped together with other motifs when originally they are not related in that way. Let me know if it's possible.

Thanks, Sanjana

jimmykhchiu commented 1 year ago

Hi Sanjana,

You may try adding the -n parameter to the mash command here. If it works, then I will make a new input argument so it does not consider reverse strands.

Jimmy

sanjanabhatnagar commented 1 year ago

Hi Jimmy,

Thanks a lot for your help. I was able to restrict reverse complements from getting clustered together. I am sure the reverse complements might be of great use in DNA datasets however, for RNA it is different or I should say at least in my case, for RBP motifs it is. :)

Thanks, Sanjana