Processes sometimes become locked during variant calling

gwct commented 3 years ago

On some datasets (Peromyscus), processes that are spawned to call variants with GATK sometimes get locked or do not spawn a new GATK process, resulting in only 1 process being used to call variants. This is obviously a major problem since one of the big advantages of pseudo-it is that it allows users to call variants in parallel and drastically speed-up run-times.

Some notes about this issue:

All but one spawned process and all parent processes are asleep (S state in htop). Only one process is running (R) and appears to be running GATK correctly (see htop screenshot below).
This is not node specific on the UM cluster.
Other datasets don't seem to do this. Specifically, the Phodopus dataset and an example exome dataset.
GATK itself spawns lots of threads, but this is normal according to this discussion. This is more threads than requested in the job or by pseudo-it, but apparently don't count towards the resources allocated, if I'm reading that thread correctly.

Things to try:

[x] Create the process pool earlier in the code so all memory isn't being copied to each fork. Looking at the screenshot below, it is possible that this is a memory issue.
[x] Change how processes are called. Currently using starmap, but I've had memory issues with this function since it stores results until all processes are complete. There really isn't anything stored for each process here though. Should try imap_unordered anyways.
[ ] Pre-chunk scaffolds from the genome to ensure that all the large scaffolds aren't being passed to the same process. I don't think this is the issue because according to this default chunksize for map and imap is 1, but again starmap might be different.
[ ] Simply lowering the number of threads... maybe GATK just uses too much memory with this dataset. In that case, better to keep things going in parallel, even if each large scaffold can't have its own process.

Screenshot of htop on the Peromyscus data: pseudo-it-procs

liphardt commented 2 years ago

I am having the same issue with a phyllotis data set. Now that it is on the larger scaffolds it is processing them sequentially. phyllotis_pseudoit

gwct commented 2 years ago

Thanks for pointing this out. It's odd how this only seems to affect some datasets. I think I'm familiar with the cluster you're running on and I can see that you're on compute-0-24. Have you tried a node with more memory?

However, as per #6, this will likely not be fixed in this version of pseudo-it since I'm aiming to re-implement the whole pipeline with snakemake. Hopefully that will be mostly done (or at least in a usable form) by the end of August. Happy to chat elsewhere about this too since you likely don't want to wait that long.

goodest-goodlab / pseudo-it

Processes sometimes become locked during variant calling #5