refresh-bio / KMC

Fast and frugal disk based k-mer counter
252 stars 73 forks source link

Run stages separately #218

Open tbenavi1 opened 12 months ago

tbenavi1 commented 12 months ago

Hello,

I was wondering if it is possible to run stage 1 and stage 2 separately? The reason why I am asking is that I am trying to optimize how many processes to use for each stage.

On that note, does stage 1 only use one process? If so, it would be great if I can run the stages separately. Thanks for any advice/suggestions.

Perhaps I need a refresher on the differences between the following options:

  -sf<value> - number of FASTQ reading threads
  -sp<value> - number of splitting threads
  -sr<value> - number of threads for 2nd stage

Thanks.

marekkokot commented 12 months ago

Hi,

Stage one is also multithreaded. The best thread usage is when there are many gzipped input files. If sf, sp, sr are all used KMC will read these parameters, in the opposite case it will use -t and automatically set these params. sf and sp are the stage1 threads. In general, we don't officially support running stages separately. With some tricks in the code (enabling some defines etc.) it is possible to stop after stage 1, and conceptually it will be possible to run stage 2 separately, but for now, there is no code to do this (it would probably require dumping some additional metadata that is stored in the memory).

Let me know what is your input (and what command you are using to run kmc), and maybe I will be able to give some advice or explain something. For multiple simultaneous runs, you could do this using a lower total number of threads. But keep in mind that KMC will write to disk a lot of data. If you run a couple of instances using the same physical device it may actually hurt the performance, because it increases the chances the disk is a bottleneck as all processes will be fighting for disk access. Anyway, if you describe the dataset (and maybe your hardware) I will try to help more.

tbenavi1 commented 11 months ago

Thanks so for much for the information.

My use case is that I am running KMC on BAM files that are hosted on a remote S3 bucket. (More precisely, I have mounted the files to a directory according to https://docs.icgc.org/download/guide/#mount-command). The documentation for this website states:

The file system implementation's performance is optimized for serial reads.
Frequent random access patterns will lead to very poor performance.
Under the covers, each random seek requires a new HTTP connection to S3 with the appropriate Range header set which is an expensive operation.
For this reason, it is only recommended for streaming analysis (e.g. samtools view like functionality).

So, the first important thing to note is that I am analyzing BAM not FASTQ files (not sure if that matters for the KMC implementation). Next, I wasn't sure whether KMC would use multiple threads in stage 1 (which would require a new HTTP connection). After stage 1, the temporary files should be written to my local directory so Stage 2 should be able to proceed as normal with multiple threads.

I think I can just run KMC like normal and everything should be fine. I just wasn't sure whether things would slow down in Stage 1 due to reading from an S3 bucket. Thanks for any advice.

marekkokot commented 11 months ago

Oh, I see. BAM files are handled a little differently, but it should not matter here. KMC may, in fact, read from a couple of files at the same time, but each read is sequential and in quite large chunks (because random access is also harmful when just reading from local HDD drives). The -sr may be used to limit the number of input files opened at once. As I understand your single run is for multiple input BAMs? Or is it always for a single BAM (in such a case a single run will just read sequentially (as far as I remember, because I implemented BAM support some time ago)). You may try to do the following to benchmark:

  1. Run KMC as you do, vs.
  2. Just download (copy) the file to the local drive (btw. it may be worth measuring the copying time) and run KMC on it (maybe worth first evicting the data from the RAM cache (for example using vmtouch -e <path_to_copied_bam>). Compare the times. I think the 1. standalone should be better than 2. as a whole (copy + kmc), but KMC from local drive should may faster than 1. If you do this please let me know about the results. Thanks!