Q: Is there a way to control threads used bei pigz?

sklages commented 6 years ago

I'd like to have control over the number of threads used by pigz without modifying the source code. It seems that all cores are used for (de)compression even if I specify --cores?

That's not a very "social approach" when more users sharing one single server .. :wink:

marcelm commented 6 years ago

Yes, the --cores option isn’t passed on to pigz. I thought about this, but I’m not certain whether I really want to do that. I cannot control how many cores are used anyway. Even now, --cores=x doesn’t give you that much control over how many CPU resources are used since that only sets the number of worker threads. There is the main thread and an additional thread that writes data; there’s a subprocess that decompresses gzipped input (or two for paired-end data); and then there can be two output pigz processes. Not all of the threads are busy 100% of the time, so ensuring that all provided cores are used is difficult.

I would probably just use nice to start the process.

You can also restrict a process to run on a certain set of CPU cores with taskset. Example:

taskset -c 0-3 cutadapt ...

This would run cutadapt on the first four cores (cores 0 to 3). Note that if a second user does the same and enforces use of cores 0-3, then they would share these cores, so this is not quite the same as requiring that a process is limited to 400% CPU.

If you want to be thorough, you could also configure your server so that one user cannot take CPU resources from another. I have no idea how to do that, but I know it’s possible because our cluster is configured this way. It probably involves using cgroups.

marcelm commented 6 years ago

No feedback, closing. Feel free to re-open if you have further thoughts.

Tjitskedv commented 6 years ago

I just started to use cutadapt with the cores option and was a bit negatively surprised that the process took far more than the 8 cores which I had set. I think it would be better to pass the number of maximum cores to pigz. And in case of paired end data just divide the number of set cores by two and pass that over to pigz. most often it isn't a problem if you use 1 additional thread, but using 20 instead of 8 might be a problem and in my opinion not very social to other users of the same server. I use cutadapt in combination with snakemake, so using taskset isn't such a nice way to implement in the pipeline. I hope you will take it in consideration.

marcelm commented 6 years ago

Thanks for the feedback. I agree that cutadapt shouldn’t use a much greater number of cores than specified with -j. Are you sure the pigz subprocess was really the problem? How did you measure CPU usage?

ranjit58 commented 5 years ago

I also agree that program should not use more resources than mentioned on commandline. It matters when we are trying to maximize resources in a large pipeline of several programs. In my case, this is causing the whole pipeline to stop. In the interim, can you direct me where to make the changes in the sourcecode to limit pigz or stop multicore usage. thanks

marcelm commented 5 years ago

@ranjit58 Why did your pipeline crash? Which Cutadapt command did you run? What were the symptoms (error message etc.)?

My impression until now was that pigz shouldn’t take that many resources. Most of the total CPU time should be spent on Cutadapt itself processing the reads and even if n pigz threads are spawned on a n-core machine, most of them should sit idle because they are waiting for something to do. I could imagine that it is a bigger problem when demultiplexing because then so many output files are opened, and pigz is spawned for each of them.

Note that pigz is only used for writing, not reading.

sklages commented 5 years ago

I think in general this is a design decision. I do use cutadapt in a cluster environment, so processes are pinned to CPUs via taskset. So using "all available cores" in this scenario is kind of "ugly" but does not trigger technical issues.

This is completely different on our multi-core "multi-purpose" servers. There we have sth. like 80 or 160 cores per machine and working groups with some bioinformaticians and biologists. Assume that half of them wants to run cutadapt as part of a RNA/RRBS/WGBS/etc pipeline ... that will conflict and in worst case will result in data loss in case system is heavily overloaded.

We haven't had that scenario yet, but we have a large computing cluster and quite a few individual servers to circumvent this :-) But others may have only one or a few multi-core systems with more than three users ...

marcelm commented 5 years ago

Thanks for the explanation. I’m of course thrilled to learn that potentially so many people use Cutadapt at the same time on a single machine ;-).

But in earnest, I acknowledge there’s a problem, I just wanted to assess which priority fixing it should have for me.

A quick fix should be to limit the number of pigz threads to something like 2 or 4. I’ll try this when I have time.

sklages commented 5 years ago

Just take one or two flowcells from HiSeq 4000 with pe WGBS data... that are 8 to 16 samples, all for the same work group. Cluster is full, you have to wait.

The impatient will run all samples on a single server in parallel...

marcelm commented 5 years ago

I have experienced a crash myself when I used Cutadapt to demultiplex into 96 compressed output files on a 20-core machine. Since the dataset was paired-end, I ended up with 96×2×20=3840 pigz threads in total, which in most cases crashed the cluster job.

I have fixed that problem by limiting the number of threads used by pigz to at most four. This was a change in the xopen library, which Cutadapt uses. I have released version 0.5.0 of xopen with this change, and simply re-installing cutadapt or running pip install --upgrade cutadapt in an existing installation will enable the fix.

This isn’t quite the same as what this issue is about, but should alleviate the biggest problems.

I’ll be happy to revisit this when I have taken care of other issues.

Maarten-vd-Sande commented 5 years ago

Would a (temporary) solution be to make use of cpulimit?

marcelm commented 5 years ago

I would see that as a solution that each user can decide to use if they want to. Integrating something like that into Cutadapt would also be a bigger project by itself.

Maarten-vd-Sande commented 5 years ago

Sure, but I think implementating a wrapper around cutadapt should be relatively easy. For users themselves to use (e.g. 4 cores): cpulimit --include-children -l 400 -- cutadapt [...]

marcelm commented 5 years ago

I cannot work on this at the moment, sorry. Feel free to send in a PR that adds this to the docs.

marcelm commented 5 years ago

Copied from #411:

I want to know exactly what the problem is, and so far no one has described to me what exactly what the problem is with spawning many processes. What cluster system, what error message? As I said previously, I would assume that if too many pigz threads are spawned, they would just sit there and wait for work while the main Cutadapt process does its thing. I’m obviously wrong somehow because there are so many people for which this doesn’t work, but in which way?

One thing I would like to know in particular: Is the pure number of processes or threads a problem? Processes are quite cheap as long as they don’t do anything. For me, running for i in {1..1000}; do sleep 10 & done is fast and doesn’t incur noticeable CPU usage.

Or, reading this issue again, is the problem mainly that Cutadapt doesn’t honor --cores?

The problem then is to find out how resources should be allocated to the different processes. Cutadapt itself has a number of workers (determined by --cores), a separate "reader" process that reads input and then it spawns a variable number of pigz processes to do the compression. The problem to me is how to determine the number of workers and the number of threads given to each pigz process based on --cores.

bernt-matthias commented 5 years ago

Moving over from https://github.com/marcelm/cutadapt/pull/411 .. ping @rhpvorderman

I want to know exactly what the problem is, and so far no one has described to me what exactly what the problem is with spawning many processes. What cluster system, what error message?

In our example we have an UNIVA grid engine (essentially equivalent to a SUN/ORACLE grid engine). For job submission one has to specify how many cores and how much memory will be used (at most) my the job. If the job uses more than these requested resources the job gets killed (automatically in case of memory and manually by the cluster admins). For the number of cores its often not a problem if the CPU usage is slightly higher than the requested number of cores for a short period of time,

I think the actual cluster engine is not important here, but the fact that on most high performance clusters that I'm aware of such restrictions are at work.

Then its problematic:

If software simply takes all available cores. Then the job would need to request for a number of cores that is equal to the maximum number of cores on a single node of the cluster (e.g. 44 on our cluster). Such jobs usually need to wait much longer until they get scheduled for computation. Also often software does not scale well for such high number of cores.
Sometimes software just uses more than the specified threads. Which is in most cases acceptable since mostly they are not active 100% all of the time.

As a user of a HPC system also parallelization withing single jobs is usually not so interesting, since with such a system users can start dozens or hundreds independent jobs at once. Which usually scales much better.

There can be a lot more than five output files. Three additional possible output files are created if --info-file, --rest-file or --wildcard-file are used. Many more output files are created if demultiplexing is done. All of these files are opened via xopen.

From the cutadapt docs:

    Multi-core Cutadapt can only write to output files given by -o and -p. This implies that the following command-line arguments are not compatible with multi-core:

            --info-file
            --rest-file
            --wildcard-file

    Multi-core is also not compatible with --format

    Multi-core is also not available when you use Cutadapt for demultiplexing.

So I think this is not a problem, since the single cutadapt process will spawn potentially a lot of IO processes via xopen. Only the cutadapt and one of the IO process with actually work at any given point in time...?

Maarten-vd-Sande commented 5 years ago

Just re-iterating: Wrapping cpulimit around would solve these problems (that's also how we solve the cluster problem in our group).

bernt-matthias commented 5 years ago

Thanks @Maarten-vd-Sande . Will need to discuss this with our admins. From the sources of the program I'm not sure how NCPU is determined.

marcelm commented 5 years ago

@bernt-matthias Ok, thanks for the thorough explanation.

If software simply takes all available cores.

Note that Cutadapt no longer takes all available cores. For writing, pigz is limited to four threads, and for reading, as we found out in marcelm/xopen#20, pigz spawns three extra threads.

Regarding --info-file etc., thanks for reminding me that these are not yet supported when running in muti-core ;-).

Only the cutadapt and one of the IO process with actually work at any given point in time...?

I’m guessing that actually more than these processes can be active at the same time. That is, even with --cores=1, CPU load should go above 100%. It can be possibly be ignored for now, but I think whatever type of load managing is implemented, this needs to be care of eventually.

@Maarten-vd-Sande Thanks, perhaps incorporating this into Cutadapt is actually the way forward.

marcelm commented 5 years ago

So does everyone agree that the actual issue is that Cutadapt should interpret --cores=N to mean “use at most N cores”? Ideally, there would be no need for a user to set the number of threads used by pigz explicitly. It’s an implementation detail and Cutadapt should take care of this itself.

The option is currently used to set the number of worker processes. There are extra processes that aren’t counted. This includes the main Python processes that collects and combines the results from the workers, an extra process that distributes work to the workers, and of course the pigz processes if .gz files are involved.

So Cutadapt should be changed such that, for example, when --cores=1 is used or no option is provided at all, CPU usage should not go above 100%. When --cores=8 is used, CPU usage should not go above 800%. And this should hold no matter whether pigz is used or whether gzip compression is involved at all.

bernt-matthias commented 5 years ago

Exactly. This would be ideal.

Maarten-vd-Sande commented 5 years ago

I agree, this also makes working with cutadapt in pipelines much easier.

sklages commented 4 years ago

Agreed. Except --cores=8 should not exceed 800%.

marcelm commented 4 years ago

Agreed. Except --cores=8 should not exceed 800%.

Of course! I fixed the typo now.

bernt-matthias commented 4 years ago

@Maarten-vd-Sande our admins comment to approaches like cpulimit, cgroups ...:

The problem is then that the scheduler of the kernel has to manage a lot of context switches (since many processes have to share a few CPUs). This still affects other jobs running on the same host.

In short: they don't like it and prefer a proper fix of the tool ..

wookietreiber commented 4 years ago

Yes, indeed we would like a fix for cutadapt. I'm one of them pesky admins and I want to give you some insight on our perspective -- in the interest of greater understanding between, devs, users and admins, communication and tearing down barriers :-)

Our Task / Goals

What we (admins) need to make sure of is to ensure the overall stability and efficiency of the system. This includes that when a user requests X CPU cores for a job, then:

The CPU usage of the job does not exceed X CPU cores.
The job does not have more than X processes/threads that want to use the CPU.

This want to use includes processes waiting for I/O, i.e. on Linux both process states R (running) and D (I/O wait). Those count towards the load of the host. If the load exceeds the physical number of CPU cores, the OS/kernel process scheduler will be busier and a lot of context switches occur because each process/thread still needs to get their fair share. This context switching can be quite expensive, wastes cycles and decreases the overall efficiency of the host, in one word: overhead.

When any of the mechanisms of core binding, cpulimit, cgroups, etc. are used, this can only ensure the first condition/requirement. However, we have seen jobs that requested only a single CPU core with core binding enabled, while the app ran with 40 threads. All of those 40 threads were limited to a single CPU core due to core binding and each thread was only getting less than 2.5% CPU usage (less than because of context switching!). Thus, these mechanisms do not ensure this condition. For this condition to hold, the app has to be configured to use exactly the amount of processes/threads the user requested in CPU cores.

(As you all can figure out, this is what this issue is about.)
The CPU usage of the job should be as close to X CPU cores as possible, i.e. the job should scale to the amount of CPU cores it requested.

This is about making sure that resources are used as efficiently as possible and no resources are wasted. If a job requests resources (and thus consumes/blocks them) we should make damn sure that they don't idle while the job runs.

Deviations from this are mainly due to these reasons:
1. User requests X cores but $app is mis-configured to use less, worst case only 1 CPU core / process / thread.
  
  We kill those jobs and let the user know they made a mistake and how to fix it.
2. I/O can't keep up and processes stay in process status D for the majority of the time.
  
  We take a close look with strace at the running $app to determine its I/O behavior. We suggest e.g. tmpfs or other storage solutions that better suits the I/O behavior. Ideally, though, we hack $app to improve its I/O behavior and provide a patch/pull request with the solution.
3. Not all parts of $app are parallelized.
  
  Depends on target threshold. Our monitoring/alerting triggers below 80% CPU efficiency. If CPU efficiency of a job is lower, we recommend requesting fewer CPU cores, often even only 1 CPU core. Concurrency over parallelism (see note below).
4. Load balancing of $app is garbage.
  
  This can mean that the load balancing algorithm is not ideal for the workload. Solutions include when tasks have short runtimes, use chunking, or when task runtimes have high variance, use work-stealing. What also helps: concurrency over parallelism.

Note: Concurrency over parallelism means, that for the entire system (and that's our concern), it's more efficient having 100 single-core jobs with 100% efficiency than e.g. 25 4-core jobs with 50% CPU efficiency. Domain decomposition etc. can help in sizing the workload to get from 25 4-core jobs to 100 single-core jobs. You get the concurrency for free simply by using the job scheduler which runs/handles multiple jobs at the same time, i.e. submitting/running thousands of single-core jobs with high efficiency is cheaper compared to submitting/running few big jobs with low efficiency. In other words: just because you can use parallelism, does not mean you should.

wookietreiber commented 4 years ago

About Thread Pools

As I understand it, from your comments above, the way cutadapt works at the moment is, abstractly, you have three thread pools:

The main cutadapt application, configurable via -j/--cores.
The spawned pigz process, uses number of physically available CPU cores, currently not configurable.
Abstractly, the I/O thread, which can be seen as a thread pool with a fixed size of 1. Takes blocks from pigz, distributes them to main application workers and/or analogous for writing.

I have seen a few other applications/libraries doing this similarly, i.e. splitting I/O off from the main application, most notably in async/non-blocking approaches where you want to have a dedicated I/O thread pool that does all the blocking operations, while your main thread pool only handles the non-blocking ones. The questions I'm always asking myself about splitting the thread pools like this are:

What is the wallclock time ratio between main worker thread pool and I/O thread pool? Is this fixed for all workloads or does it change with inputs?
How do you need to size both thread pools that both run at (or at least close to) 100% CPU efficiency?
Which thread pool is the bottleneck?
If I/O is the bottleneck: Do my I/O source(s)/sink(s) scale up with higher number of readers/writers?
If main workers are the bottleneck, i.e. time spent getting a block of I/O is small compared to the time needed to process that block, why not reduce complexity of different thread pools and just have the workers do I/O?

My Recommendation for HPC Clusters / Job Scheduling

You already did the preparation in https://github.com/marcelm/xopen/pull/23. Now, to apply this to cutadapt, we first need to answer these questions from above:

What is the wallclock time ratio between main worker thread pool and I/O thread pool?

Is this fixed for all workloads or does it change with inputs?

Which thread pool is the bottleneck?

I don't know cutadapt well enough, to answer this first set of questions. I'm providing recommendations for both outcomes.

worker thread pool is (always) the bottleneck, i.e. wallclock time worker (much) greater than wallclock time I/O, i.e. the I/O threads idle for most of the time, independent of input / workload

Make sure that if --cores $amount is specified, there are only $amount processes/threads, no special I/O thread(s), no spawned external (de)compression tools. Use this procedure:

1st worker reads
all workers perform parallel algorithm
1st worker writes
repeat until EOF of input

This scales up/out depending on the exact ratio, pretty much exactly according to Amdahl's Law. If you know the ratio and have a CPU efficiency threshold, you can recommend a maximum amount of CPU cores to request to users. If algorithm changes and ratio with it, change recommendation accordingly. Problem solved.

either the I/O thread pool is the bottleneck or its somewhere in the middle

Answer these:

How do you need to size both thread pools that both pools run at 100% CPU efficiency?

Do my I/O source(s)/sink(s) scale up with higher number of readers/writers?

You definitely will need to benchmark to figure that out, because it depends on both CPU and storage of the system you want to target. Whatever the results are, you need command line arguments for both the worker thread pool size and the reader thread pool size.

The answer could be: on system X you need 3 workers per I/O thread, scaling up to 12 workers and 4 I/O threads, because starting with 5 I/O threads the storage can't read/write fast enough.

Benchmark this, put results in the user documentation / wiki of system X. Problem solved.

Note: If the wallclock time ratio between workers and I/O threads is in fact fixed, the admins of system X can do the benchmarking and provide the results for all users. If it varies by input and depending on the degree of variation, users might need to benchmark their specific workload / inputs to increase CPU efficiency. Admins could still provide estimates for ranges, e.g. "3-5 workers per I/O threads".

marcelm commented 4 years ago

Perhaps I should clarify that – contrary to what I wrote in one of the earliest comments in this issue – i acknowledge that there is a problem, and that I am working towards solving it.

However, Cutadapt isn’t my main job, so I need to proceed at my own pace. I can fortunately spend some of my working hours on personal (bioinformatics) projects and have used a lot of that time for Cutadapt. I’m motivated to make Cutadapt work for others, even if I personally don’t benefit from it – but it needs to remain fun. As long as I can do the things I want, it’s fine, but when the discussion moves into a territory where I get the impression that demands are made, then the fun stops. To be sure, the tone in this thread has been civil and reasonable, but it is the sheer amount of text which is not not helping.

Let me figure this out, one step at a time. Currently, I’m trying to make --cores=1 use exactly one core by doing all compression and decompression in-process, without calling an external process at all. With a single core, Cutadapt doesn’t spawn any worker process or I/O processes anyway, so this should be relatively easy. Perhaps I may be able to get this done this or next week.

@wookietreiber Thanks for your insights. I don’t have the time to reply at the moment, so this may need to wait.

One correction: Cutadapt no longer lets pigz use all available cores for compression. This has been limited to 4 since a while back. And decompression has recently been limited to one external process.

wookietreiber commented 4 years ago

@marcelm Thank you for the response and sorry about the wall of text(s). Let me know if and how I can help.

marcelm commented 4 years ago

i’ve just pushed a commit that makes Cutadapt no longer use subprocesses for gzip compression when --cores=1 is used (or when --cores is not specified). Input files are still read through a pigz process (using one thread) because its gzip decompression is more efficient. Total CPU usage is exactly 100%, though (it appears that the two processes never run at the same time).

I think I may remove this reader subprocess as well because gzip decompression is just 2.5% of the total time (when reading a .fastq.gz, removing one adapter, and writing to a `.fastq.gz).

In case anyone is wondering why this was ever done with subprocesses: gzip compression and decompression using Python’s built-in gzip module used to be very slow, so using an external gzip process was a workaround to get good speed (pigz came later). Nowadays, they are equivalent, so we can go back to the builtin.

I’ll start looking into the multi-core case, as time permits.

Maarten-vd-Sande commented 4 years ago

Good, this solves the problem with shared nodes in clusters. :tada:

marcelm commented 3 years ago

I’m closing this issue now because it appears that the externally spawned subprocess haven’t been a problem since I made the above changes. The following additional changes were made, and I’m listing them here in case anyone is interested or they lead to problems (please open a new issue if that is the case):

No external process is spawned anymore for decompression.
We now use https://github.com/pycompression/python-isal for decompression, which is a lot faster than both built-in and command-line gzip.
Use that library also when compressing if the compression level is at most 3 (use -Z to get compression level 1, highly recommended for intermediate files). This spawns an external process at the moment if --cores is greater than 1, but one that uses only one thread.

marcelm / cutadapt