refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

BUG: kmc_tools cannot analyze pre+suf files, but kmc_dump could #82

Open gyslovecj opened 6 years ago

gyslovecj commented 6 years ago

GCF_000392535.3_ASM39253v3_genomic.fna.gz For this test_data, which is downloaded from ncbi, we can use kmc (./kmc -k18 -ci1 -cs1 -fm -f filename outputname ./tmp_data) to generate .pre and .suf files, and then we can get a human-readable file with command(./kmc_dump outputname dumpfile). However, we cannot use kmc_tools to analyze these files.(./kmc_tools outputname outputname union tools_file)

marekkokot commented 6 years ago

Thanks a lot for reporting! As it turns out, there was a bug in KMC, so even if kmc_dump was able to produce dump file, it was not correct. It is now fixed with 2d47e8744c0d332ecdf4bc089b4a501a89ef73bf

nsharma333 commented 5 years ago

I think that this may be the problem I am having currently. kmc_dump works with files created by kmc but I get the error "Error while reading suffix file" when trying to use kmc_tools.

I'm using KMC 3.1.0 and am having a bit of trouble compiling the latest release. Will you be releasing a binary for KMC 3.1.1RC1?

Thanks for making KMC available to users!

marekkokot commented 5 years ago

Hi, actually, I wasn't planning precompiled versions for RC. Could you please tell me more about troubles with compiling the latest release.

Anyway, I am attaching precompiled version for Linux to this comment (GH only allows zip archives so the tar.gz is further compressed to zip). If your platform is not Linux, then let me know, which platform you are using. KMC3.1.1.RC1.linux.tar.gz.zip

Thanks for using KMC :)

nsharma333 commented 5 years ago

Thanks for your help!

kmc_tools -v simple 21mers_with_target 21mers_without_target intersect 21mers

I've tried several variations and still get the error message. But never any issues with kmc_dump. Any thoughts? Here is a link to the database files in case you have time to review:

https://www.dropbox.com/s/k0phr5yipz65epe/KMC_database_files.zip?dl=0 compiler_output.txt

Thanks again!

marekkokot commented 5 years ago

Hi,

On some systems there are little issues with static compilation, you may try removing -static flag from makefile as well as the -Wl,--whole-archive:

CC  = g++
CFLAGS  = -Wall -O3 -m64 -lpthread -std=c++11
CLINK   = -lm -O3 -lpthread -std=c++11

KMC_TOOLS_CFLAGS    = -Wall -O3 -m64 -lpthread -std=c++14
KMC_TOOLS_CLINK = -lm -O3 -lpthread -std=c++14

The second issue seems to be more important. Thanks for reporting and sharing files. It would be really appreciated if you could also send your input sequences used to create kmc databases and command lines used, so that I could reproduce your pipeline.

nsharma333 commented 5 years ago

Just tried recompiling without the -static -Wl, -whole-archive flags and it worked perfectly! Compiled on the HPC without any errors. Thank you!

Here is a link to the FASTA files used to create the databases:

https://www.dropbox.com/s/q2tta8eeqly3g0a/KMC_Fasta.zip?dl=0

I am using these commands to create the KMC databases:

kmc -m300 -fm -k21 with_target.fasta 21mers_with_target /home/sharma/KMC/kmc_tmp_dir/ kmc -m300 -fm -k21 without_target.fasta 21mers_without_target /home/sharma/KMC/kmc_tmp_dir/

Note that the server has 384GB of memory (hence the -m300).

Thanks!

marekkokot commented 5 years ago

I have checked your databases and they are ill-formed, so I suspect there was something wrong with KMC. The problem is that on the machine I have used to reproduce your commands (equipped with 512GB of memory) the resulting databases seems to be OK.

What is the number of cores on your server? (could you run KMC with -v switch and send me the output?) Could you please try rerun your commands with smaller amount of memory. If the problem still occurs, could you check it on other machine. The resulting databases of my run are avaiable here: https://www.dropbox.com/sh/61cototf89erw4g/AABSVavLEw9xq1QA6MaeN_P2a?dl=0 (21mers is the output of kmc_tools intersect command). When you create a KMC database, you may compare it with your output using not documented compare command of kmc_tools, like:

./kmc_tools compare <db1> <db2>

BTW. In general KMC is quite memory frugal, for example for a dataset containing 736.4 Gbases (614.1 Gbytes gzipped) kmc used 33GB of memory to complete computation. If the amount of memory specified with -m is not enough KMC will use more memory to complete computation (unless you used -sm switch). Increasing the value of -m parameter may improve performance. I do not know what is the size of other input you plan to use, but for this simple case you sent I would suggest not specifing -m parameter at all, because the default 12GB will be for sure enough. On the other hand if you have slow access to disk you may consider using -r switch (then KMC will store intermediate files in the memory instead of disk). It may be more helpfull than specifing -m300. On the other hand, due to caching perfomed by linux when accessing disk and some other technical details sometimes KMC works better without -r, which is a little surprising. Of course it doesn't mean that KMC may produce wrong results for some parameters configuration, so I woule really want to fix this, but unformtunatelly, as I cannot reproduce your results it is a little hard to find the cause of a bug.

nsharma333 commented 5 years ago

Thank you for your reply. I sincerely appreciate it. Your message helped me see that the issue was with kmc_and not with kmc_tools. As you recommended, I played around with several of the command line parameters. What I figured out is that the number of threads is the primary issue. If I use just 1 thread (-t1) then everything works properly every time but if I try to use more than one thread (e.g. -t8 or -t36 or not use the -t parameter at all) then I run into issues.

Not sure what is causing the thread issue as I do run many other programs that use multiple threads (i.e. blast). But it may have to do with the way the system is configured. As I mentioned, I'm running this on a university HPC. All nodes have 384 GB of memory and 36 Xeon cores. Attached are the slurm output files for -t1 and -t8. You will see that the -t8 gives an error message.

But the bottom line is that I am now able to incorporate KMC into my pipeline! It is working perfectly for my needs (albeit using only 1 thread). If you have any thoughts on trying to get multiple threads working, I'm happy to try it out.

I'm planning to try the python binding soon too.

Thanks!

KMC_slurm.zip

marekkokot commented 5 years ago

Thanks for the info. I would really want to fix this issue, but I don't have access to HPC, thus I cannot reproduce this bug a find its cause. If you have some time, I could prepare a little modified KMC version, that disables multithreading for some of its parts. In fact, for the first try, it may be configured with parameters. There are three parameters that configure the number of threads for specific parts of the KMC. Those are: -sf - number of threads related to reading input files (as you have one input file it is set to one) -sp - number of threads used to extract so called super k-mers from reads -sr - number of threads used in the second stage of KMC.

KMC will respect those parameters only if you specify all of them and do not specify -t paremeter. My gues is that the issue is related to second stage so I would suggest to specify -sf1 -sp7 -sr1. If the issue is not occuring in this case it means that there is something wrong in the second stage as I suspect. To dig dipper I would need to prepare modified KMC. That is because there ara a coule of parts in the second stage that are parallelized, and I could disable them one by one. If, for the parameters I presented above the problem still occurs, you may try to configure them as follows: -sf1 -sp1 -sr8.

I absolutly understand that you probably will not have enaught time for testing such cases, especially since it will probably require a couple of iterations of preparing modified KMC versions and trying them againg and again, but if you could find the time it would be really apriciated.

I am glad that you will try the python binding, my knowledge about python is not so deep, so I am open to any suggestions. Also keep in mind that if the performance is crucial for you, using C++ API is a much better choice.

Thanks again!

nsharma333 commented 5 years ago

So sorry for the delay! I'll run some tests using the -sf, -sp, and -sr settings this week and report back the data. I appreciate you looking into this issue. Thanks and sorry again for the late reply.

On Sat, Jan 5, 2019 at 9:56 AM marekkokot notifications@github.com wrote:

Thanks for the info. I would really want to fix this issue, but I don't have access to HPC, thus I cannot reproduce this bug a find its cause. If you have some time, I could prepare a little modified KMC version, that disables multithreading for some of its parts. In fact, for the first try, it may be configured with parameters. There are three parameters that configure the number of threads for specific parts of the KMC. Those are: -sf - number of threads related to reading input files (as you have one input file it is set to one) -sp - number of threads used to extract so called super k-mers from reads -sr - number of threads used in the second stage of KMC.

KMC will respect those parameters only if you specify all of them and do not specify -t paremeter. My gues is that the issue is related to second stage so I would suggest to specify -sf1 -sp7 -sr1. If the issue is not occuring in this case it means that there is something wrong in the second stage as I suspect. To dig dipper I would need to prepare modified KMC. That is because there ara a coule of parts in the second stage that are parallelized, and I could disable them one by one. If, for the parameters I presented above the problem still occurs, you may try to configure them as follows: -sf1 -sp1 -sr8.

I absolutly understand that you probably will not have enaught time for testing such cases, especially since it will probably require a couple of iterations of preparing modified KMC versions and trying them againg and again, but if you could find the time it would be really apriciated.

I am glad that you will try the python binding, my knowledge about python is not so deep, so I am open to any suggestions. Also keep in mind that if the performance is crucial for you, using C++ API is a much better choice.

Thanks again!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/refresh-bio/KMC/issues/82#issuecomment-451662494, or mute the thread https://github.com/notifications/unsubscribe-auth/ApFR2aOccKifK3qOb5se5S7J4uWWZTmuks5vAL0UgaJpZM4Uzh8z .

marekkokot commented 5 years ago

Hello there, sorry for being intrusive, but maybe you have checked this and have some results?

nsharma333 commented 5 years ago

My sincere apologies for the delay! Yes you are right. The problem does seem to be with the second stage.

Using these parameters "-m256 -sf1 -sp36 -sr1 -r -fm -k21" does not result in any errors.

However, using these parameters "-m256 -sf1 -sp1 -sr36 -r -fm -k21" reproduces the error issue.

See attached log files.

I'd be happy to collect further data for you.

Thanks!

On Fri, Jan 25, 2019 at 1:54 AM marekkokot notifications@github.com wrote:

Hello there, sorry for being intrusive, but maybe you have checked this and have some results?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/refresh-bio/KMC/issues/82#issuecomment-457474474, or mute the thread https://github.com/notifications/unsubscribe-auth/ApFR2Ykro_da2r3qpRbQ31xWNVPoc3Oyks5vGqoLgaJpZM4Uzh8z .

marekkokot commented 5 years ago

Don't apologize for the delay, I am glad that you reported this issue and you are willing to help :) Thank for the info. I cannot see the attached files, I suspect you just responded by email, and I am not sure what github does in such a case. Nevertheless, I prepared a little-modified version of KMC. There is one part in the second stage that I suspect may cause this issue when is running multithreaded so in the attached file I force this part to be always single threaded. It would be really helpful if you could run your test with this modified version. You may try to specify -t36 without specifying -sf -sp -sr. If the result is OK it will mean that the part of the code I suspect is, in fact, the reason for the issue. If the result is wrong it will mean that another part is the reason. In both cases probably more digging will be needed, but it is some starting point, so I would really appreciate if you could check it.

The package contains modified KMC source code (in fact only one line has changed) and precompiled binary for your convenience. The package: KMC.zip

Thanks again!

nsharma333 commented 5 years ago

Thanks again for looking into this issue. I did run the modified KMC this morning but the problem still seems to exist. See this link:

https://www.dropbox.com/s/j42l87w90x1wvpd/KMC_data_1_27_2019.zip?dl=0

Using -t1 seems to work fine but using -t36 appears to result in a database that is malformed as it doesn't work properly with kmc_tools.

I'm happy to run additional tests if it will help troubleshoot. I've been using KMC for a few weeks now and it is working great in my pipeline (just have to use -t1 but even one core is pretty fast). Thanks!

marekkokot commented 5 years ago

Hi, thanks for checking this out.

If you would kindly find a time to check the next try: KMC.zip.

I hope it will tell us something more.

Its is great that KMC fit into your pipeline :) I hope I will be able to solve this issue to speed up computation even more :)

nsharma333 commented 5 years ago

Hi!

Apologies again for the delay. I did try the new version test version of the software. Running it with -t36 isn't causing any errors to be generated but the database doesn't seem to be correct. See this link:

https://www.dropbox.com/s/nprp3ppfxqcyayv/KMC_data_2_3_2019.zip?dl=0

You will notice the final (.mers) file is different when KMC is run with -t1 vs -t36. Any ideas on what may be happening?

Thanks, Neil

marekkokot commented 5 years ago

Hi!

Thanks. Well, this is totally not what I expected...

Maybe it is somehow related to slurm (I have never used it before). Maybe I will try to install slurm on my system and check it then. Could you please give me some tips on how to run software with slurm? (for example how you run kmc using slurm) I am little surprised with the order of output in "slurm.err", for example I don't understand why there are a couple of instances of stage 1, and after them, there is stage 2. I thought it should be rather more like:

stage1: ... stage 2:... summary of kmc run stage1: ... stage2:... summary of kmc run

I feel like it is somehow related to slurm and I will need to understand how it works.

On the other hand, could you please try to run kmc with -r when you specify -t36 using the version I send last time? With this option, KMC will not create intermediate files on disk, I am curious if results will still be wrong. If not, it is some clue (though I still do not know what it would mean, and suspect some slurm related behavior). If the result is still wrong it would may be informative if the new wrong result is the same as previous.

I know there is a lot of guessing, but I got not better idea :(

Thanks again for looking into this, Marek