labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
242 stars 30 forks source link

Hanging processes #300

Open ericolo opened 2 weeks ago

ericolo commented 2 weeks ago

Hello,

I am running ppanggolin on about 40K species, on a slurm managed cluster, and after some time the multi-threaded processes hang indefinitely. My initial thinking was that the total memory used by all the processes exceeded the available memory, and indeed by lowering the number of CPUs I was able to run more of them. But still after some time it ends up hanging, no matter the number of genomes in the input.

This is how I'm using ppanggolin:

readarray -t LIST < pangenome_out_ALL/list_chunks_2.txt FILE=${LIST[$SLURM_ARRAY_TASK_ID]}

for chunk in $(cat pangenome_out_ALL/chunk_lists_2/${FILE}) ; do if ! ppanggolin panmodule --anno list_gtdb_species_3/species_metadat_IMG_GTDB_fixed-$chunk.tsv -c 16 -o pangenome_out_ALL/output/$chunk --clusters clu_gtdb_species_ALL/prot_clu-$chunk.tsv --infer_singletons --rarefaction ; then echo "########### failed for $chunk" rm -r pangenome_out_ALL/output/$chunk fi done

I was hoping that failed jobs would not hang, thus using a for loop...

This is the output of what happens before I cancel because it's hanging:

66%|██████▌ | 356/540 [00:05<00:02, 67.92samples partitioned/s]2024-11-05 12:04:20 partition.py:l226 WARNING Partitioning did not work (the number of genomes used is probably too low), see logs here to obtain more details /tmp/tmpfv6pgcr0/17 89%|████████▉ | 481/540 [00:07<00:01, 58.32samples partitioned/s]2024-11-05 12:04:22 partition.py:l226 WARNING Partitioning did not work (the number of genomes used is probably too low), see logs here to obtain more details /tmp/tmpfv6pgcr0/54 91%|█████████ | 492/540 [00:07<00:00, 69.42samples partitioned/s]2024-11-05 12:04:22 partition.py:l226 WARNING Partitioning did not work (the number of genomes used is probably too low), see logs here to obtain more details /tmp/tmpfv6pgcr0/56 100%|█████████▉| 539/540 [00:18<00:00, 43.59samples partitioned/s]slurmstepd: error: *** JOB 12448523 ON n0052.dori0 CANCELLED AT 2024-11-06T09:42:15 ***

This is what I see when top -u myusername when I log in to the node: image

Is this something normal ? Have you ever experienced this ? I have no idea what causes the problem or how to fix it, and I'm not really sure if it's relevant to create this issue, I apologize if it's not The only solution I see is setting a timeout, but since I have different sizes of species it is hard to pick a time that's not too long or too short

Thanks in advance, Eric

axbazin commented 2 weeks ago

Hi,

This is definitely relevant, thank you for reporting ! This looks a lot like the problem originally reported in #195 ... but you get it more often since you have 40K pangenomes.

Do you know if it replicates? As in, if it hangs for a given species for a run, it always hangs, or sometimes it does work on a different run with the same set of genomes? At the rarefaction step there is a lot of randomness involved, so the problem may be not so easy to catch ... But knowing that would mean that it is related to having a particular genome structure in the subsampling done during the rarefaction.

I can't really think of a workaround for now... Not sure who yet, but someone will be investigating and hopefully can get back and solve this issue.

Since you work with a lot of different pangenomes (40k is quite a number !), it looks like you have been running in a number of edge cases lately! Thanks a lot for reporting them, it will help us make ppanggolin more robust on the long run.

Adelme

ericolo commented 2 weeks ago

So I tested two cases and it did replicate, it ended up hanging, and it is not a problem of memory usage... so I think there's indeed no workaround

Here's a video of what happens for one of the cases: https://drive.google.com/file/d/1e2rdn2YMABzTt9v09LR_WzTxFtjoXhuu/view?usp=sharing

Here I only let it hang for a few seconds before canceling, but it hangs indefinitely if I don't cancel

The log file of that particular run: debug_555.log

And here are the files for that run if you want to try: https://drive.google.com/drive/folders/1PAtZlw-nXfnKdwYPzc4NeVvER2l5C592?usp=sharing

In that folder you can find the clustering file: prot_clu-s__Eubacterium_R_faecale.tsv

This is the list of genomes: species_metadat_IMG_GTDB_fixed-s__Eubacterium_R_faecale.tsv

And then the gff folder contains all 19 gff files

I also tried running this example without my own protein clustering file, and it worked ! so maybe the problem is in my clustering file ? (This worked for both of the cases I checked)

Now I hope all of those runs that end up hanging have the same problem, if not it's going to be hard to debug all of them...

Yes 40k is a lot but most of them don't have enough genomes (about 10~15k have more than 15 genomes), and since I didn't want to empirically decide the minimal number of genomes (I was thinking that maybe 5 genomes could be OK), I'm kind of benchmarking :) No problem, thanks for helping me and thanks for developing ppanggolin ! :)

Let me know if you need anything else

Eric

axbazin commented 2 weeks ago

This is perfect, thanks for checking repeatability and thanks a lot for the log and the data !

ericolo commented 1 week ago

I tried running all the hanging ones without my custom clustering file and for some it works, for others it still hangs... so it's not a strict rule, I can point out these cases later as well if you want (It came down to two cases like this, and I just removed one genome of the list of GFF to go around this, and it worked with my custom clustering)