Seed for minhash algorithm

refresh-bio / kmer-db

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).

GNU General Public License v3.0

81 stars 16 forks source link

Seed for minhash algorithm #11

Open CBorreda opened 4 years ago

CBorreda commented 4 years ago

I've read the paper for kmer-db and I weren't able to find anywhere whether kmer-db uses a seed for minhashing during the build step. I've ran kmer-db build twice (from KMC-counted kmers) and it seems to use the same seed every time, since the results are identical. Is there a way to alter this seed? I'd like to somehow generate ~100-200 distance matrices and use them as a support value for the distance estimations, but I would need to minhash with a different seed each time.

Best

Carles

agudys commented 4 years ago

Dear Carles,

At the moment there is no possibility to use different seeds - we will add this feature in the next release. In the meantime, you can try generating a distance matrix without minhashing (no -f parameter specified) to obtain more stable results. How large is your dataset?

Regards, Adam

CBorreda commented 4 years ago

Yes I know I could use the whole kmer number but that is too much to run in my machine.

I am analyzing 75 samples resequenced by illumina. I used -ci5 in kmc to get rid of erroneous kmers (those with a count lower than 5) since, as far as I understood, they might inflate the RAM usage of kmer-db build. I checked and 5 is the upper limit to filter out by kmer abundance in my samples, mainly due to some low-coverage samples I need to keep in the dataset.

I have ran the whole pipeline (build, all2all and distance) for 3% of the kmers, it took about 40% of my RAM. I could try to increase the fraction to 5 or 10% but I think I won't be able to use the whole dataset. Still, the tree looks good so far, I just want to give it some bootstrap support. Since I have some other projects to work in, I could go into a different project for some time and come back later to this project to check if the feature is implemented. I see this project is in constant development.

Best Carles

agudys commented 4 years ago

Actually, there is something you could use. There is an undocumented option -f-start that was designed to process all kmers in portions. It represents the relative minimum threshold of the minhash filter (whille -f its the filter width). Therefore, you can for instance run kmer-db 10 times at each run analyzing different 10% of k-mers:

-f 0.1 
-f 0.1 -f-start 0.1
-f 0.1 -f-start 0.2
...
-f 0.1 -f-start 0.9

It's not exactly bootstraping (no replacement in sampling), but maybe you can find it useful.

agudys commented 4 years ago

I've accidentally sent you a half of the comment but its been edited now :)

CBorreda commented 4 years ago

Very nice! You're right, this is not exactly what I was looking for (due to the lack of replacement in sampling), but it will for sure allow me to do some testing of the robustness of the tree. Still, I'll check for updates on the main request about the seeding.

I was wondering how would this option handle overlapping windows, say

-f 0.1 -f-start 0 -f 0.1 -f-start 0.01 -f 0.1 -f-start 0.02

I guess it would resample (not randomly though) part of the kmers?

Best Carles

agudys commented 4 years ago

Exactly, you'll have overlapping k-mer spectra used in distance calculation. To have real bootstrapping, different seeds are needed. We'll work on that.

CBorreda commented 4 years ago

Hi there,

Have you managed to implement a way to specify a seed to the minhash algorithm, as we commented? I have even tried to dig in your source code, but without C knowledge, I can't really understand what's going on there.

Best,

Carles

agudys commented 4 years ago

Hello!

We had some ideas about, but didn't want to provide a solution without testing if it's properly random. We'll dig into that again soon and let you know.

Adam