nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
197 stars 30 forks source link

Megalodon analysis freeze because of computing ressources ? #213

Closed pterzian closed 2 years ago

pterzian commented 3 years ago

Hi,

I have large datasets I would like to call with megalodon. These are promethION runs with between 300G and 2T fast5. My system has only one GPU but it manages to fully call the smaller runs (around 300/400T).

However when running megalodon on larger runs it stucks without any warning or error message. Jobs are mostly shown silenced in the monitoring and it is basically stuck on read processing :

Read Processing: 100%|█████████▉| 3833144/3833210 [23:20:46<00:01, 45.61reads/s, samples/s=4.86e+6]

It can keep those job active for weeks without finishing the megalodon analysis. My only solution is to kill all my jobs so I can restart the whole analysis. I am pretty sure it is a computing ressource issue, I also faced it on other system.

I am using the default command :

megalodon /home/pterzian/data/sample/fast5/ \
    --outputs basecalls mods \
    --guppy-params "-d /home/pterzian/rerio/basecall_models/" \
    --guppy-config res_dna_r941_prom_modbases_5mC_CpG_v001.cfg \
    --reference reference.fa \
    --mod-motif m CG 0 --devices 0 --processes 6 \
    --verbose-read-progress 3 \
    --guppy-server-path /home/pterzian/ont-guppy/bin/guppy_basecall_server \
    --output-directory megalodon_output/ \
    --overwrite

I am not sure which computing option would be good to use in my case, maybe someone give me a hint ? I thought of decreasing the --reads-per-guppy-batch option but somehow it keep telling me : megalodon: error: unrecognized arguments: --reads-per-guppy-batch 50

thanks a lot,

Paul

marcus1487 commented 3 years ago

It could be that megalodon is stuck writing output queues, but weeks certainly seems like a larger issue. I've tried to clean up output processes regarding issues such as this, but there may be lingering issues. The --reads-per-guppy-batch option was replaced by --guppy-concurrent-reads in a recent release, but I doubt this will help resolve this issue. I suspect that this might be an issue with a very large mods database file (sqlite) or bam file (pysam) that is having trouble flushing results to disk after a run is complete. It is quite difficult to diagnose these issues though.

I would suggest to break runs down into smaller jobs and merge relevant outputs to avoid these issues.

pterzian commented 3 years ago

Hi,

So I tried splitting runs into smaller runs but it actually didn't work any better. It is not so sure my issue come from the available resources on my system anymore. I actually succeeded running bigger runs that the ones that recently failed. However I found that some reads "timed out" in the guppy logs so you might be right about megalodon stucking on writing output queues. I runned twice megaldon with the same pull of fast5 and it stucked for the same amount of reads. Interestingly this amount match the number of thread I gave to megalodon's command line : 10.

Read Processing: 100%|█████████▉| 1999990/2000000 [7:29:20<00:00, 74.18reads/s, samples/s=7.73e+6]

Looking at the guppy logs : First try :

2021-11-18 13:03:17.270330 [guppy/info] Client 6 anonymous_client_6 id: 94e92fa6-4628-4081-80fd-e355dbf8e5c1 has timed out.
2021-11-18 13:03:17.270444 [guppy/info] Client 6 anonymous_client_6 id: 94e92fa6-4628-4081-80fd-e355dbf8e5c1 has disconnected.
2021-11-18 18:34:29.251396 [guppy/info] Client 2 anonymous_client_2 id: 2576ce39-36f3-41af-bc33-ff0068bbaea4 has timed out.
2021-11-18 18:34:29.251540 [guppy/info] Client 2 anonymous_client_2 id: 2576ce39-36f3-41af-bc33-ff0068bbaea4 has disconnected.
2021-11-18 18:34:29.251578 [guppy/info] Client 3 anonymous_client_3 id: 3eae8647-3232-4df6-9df8-c75ac11ca5d4 has timed out.
2021-11-18 18:34:29.251605 [guppy/info] Client 3 anonymous_client_3 id: 3eae8647-3232-4df6-9df8-c75ac11ca5d4 has disconnected.
2021-11-18 18:34:29.251623 [guppy/info] Client 4 anonymous_client_4 id: 6d5a0016-db06-4c9b-9432-cbcdcf5ee4c9 has timed out.
2021-11-18 18:34:29.251644 [guppy/info] Client 4 anonymous_client_4 id: 6d5a0016-db06-4c9b-9432-cbcdcf5ee4c9 has disconnected.
2021-11-18 18:34:29.251664 [guppy/info] Client 8 anonymous_client_8 id: ba586b50-c639-437d-a680-4daccc79fa77 has timed out.
2021-11-18 18:34:29.251685 [guppy/info] Client 8 anonymous_client_8 id: ba586b50-c639-437d-a680-4daccc79fa77 has disconnected.
2021-11-18 18:34:29.251703 [guppy/info] Client 10 anonymous_client_10 id: bfb1b0e7-157a-4211-9d97-180546adbcaf has timed out.
2021-11-18 18:34:29.251725 [guppy/info] Client 10 anonymous_client_10 id: bfb1b0e7-157a-4211-9d97-180546adbcaf has disconnected.
2021-11-18 18:34:29.251742 [guppy/info] Client 11 anonymous_client_11 id: 8ae72a7c-6095-47f8-b8db-626e1d4abd68 has timed out.
2021-11-18 18:34:29.251772 [guppy/info] Client 11 anonymous_client_11 id: 8ae72a7c-6095-47f8-b8db-626e1d4abd68 has disconnected.
2021-11-18 18:34:30.251893 [guppy/info] Client 5 anonymous_client_5 id: 91cdc1c5-5cb2-48d3-968c-6de149a33b6e has timed out.
2021-11-18 18:34:30.251969 [guppy/info] Client 5 anonymous_client_5 id: 91cdc1c5-5cb2-48d3-968c-6de149a33b6e has disconnected.
2021-11-18 18:34:30.251996 [guppy/info] Client 7 anonymous_client_7 id: 340fde6a-3c52-4b93-b64c-6a18834031a5 has timed out.
2021-11-18 18:34:30.252019 [guppy/info] Client 7 anonymous_client_7 id: 340fde6a-3c52-4b93-b64c-6a18834031a5 has disconnected.
2021-11-18 18:34:30.252077 [guppy/info] Client 9 anonymous_client_9 id: b9a70478-446f-46dc-9663-fefc3ed6449b has timed out.
2021-11-18 18:34:30.252100 [guppy/info] Client 9 anonymous_client_9 id: b9a70478-446f-46dc-9663-fefc3ed6449b has disconnected.

Second try :

2021-11-19 01:23:52.287070 [guppy/info] Client 3 anonymous_client_3 id: fbc289e4-6947-49c2-94f9-6acdb78b26bf has timed out.
2021-11-19 01:23:52.287747 [guppy/info] Client 3 anonymous_client_3 id: fbc289e4-6947-49c2-94f9-6acdb78b26bf has disconnected.
2021-11-19 06:38:52.276896 [guppy/info] Client 2 anonymous_client_2 id: 731a1a50-e193-4add-871a-bda7adbc7a76 has timed out.
2021-11-19 06:38:52.277717 [guppy/info] Client 2 anonymous_client_2 id: 731a1a50-e193-4add-871a-bda7adbc7a76 has disconnected.
2021-11-19 06:38:52.277753 [guppy/info] Client 6 anonymous_client_6 id: 0f6be4e9-2ed3-4cb0-bdda-6a5a3301e68c has timed out.
2021-11-19 06:38:52.277780 [guppy/info] Client 6 anonymous_client_6 id: 0f6be4e9-2ed3-4cb0-bdda-6a5a3301e68c has disconnected.
2021-11-19 06:38:52.277800 [guppy/info] Client 7 anonymous_client_7 id: 96e64550-5573-4cdb-9d13-2640692a9189 has timed out.
2021-11-19 06:38:52.277823 [guppy/info] Client 7 anonymous_client_7 id: 96e64550-5573-4cdb-9d13-2640692a9189 has disconnected.
2021-11-19 06:38:52.277847 [guppy/info] Client 8 anonymous_client_8 id: dc567293-7b58-4bb4-a59c-12da5f3c58fc has timed out.
2021-11-19 06:38:52.277869 [guppy/info] Client 8 anonymous_client_8 id: dc567293-7b58-4bb4-a59c-12da5f3c58fc has disconnected.
2021-11-19 06:38:52.277888 [guppy/info] Client 9 anonymous_client_9 id: 3510ba09-ff04-41d5-9da1-6f13daccd30d has timed out.
2021-11-19 06:38:52.277909 [guppy/info] Client 9 anonymous_client_9 id: 3510ba09-ff04-41d5-9da1-6f13daccd30d has disconnected.
2021-11-19 06:38:52.277928 [guppy/info] Client 10 anonymous_client_10 id: eefe8a15-6f73-4bf3-b292-8af827232b91 has timed out.
2021-11-19 06:38:52.277952 [guppy/info] Client 10 anonymous_client_10 id: eefe8a15-6f73-4bf3-b292-8af827232b91 has disconnected.
2021-11-19 06:38:52.277971 [guppy/info] Client 11 anonymous_client_11 id: 90460a89-37ed-46b8-8b70-4fd0caf22743 has timed out.
2021-11-19 06:38:52.277992 [guppy/info] Client 11 anonymous_client_11 id: 90460a89-37ed-46b8-8b70-4fd0caf22743 has disconnected.
2021-11-19 06:38:53.278264 [guppy/info] Client 4 anonymous_client_4 id: e6203ac8-05dc-4450-9eb6-b6e67486ae79 has timed out.
2021-11-19 06:38:53.278343 [guppy/info] Client 4 anonymous_client_4 id: e6203ac8-05dc-4450-9eb6-b6e67486ae79 has disconnected.
2021-11-19 06:38:53.278396 [guppy/info] Client 5 anonymous_client_5 id: 0d55588a-1236-42d4-b77e-15c8bde76cc2 has timed out.
2021-11-19 06:38:53.278421 [guppy/info] Client 5 anonymous_client_5 id: 0d55588a-1236-42d4-b77e-15c8bde76cc2 has disconnected.

I don't think it comes from the reads because the ones that comes out the first test are not the same in the second test.

Also I can see this warning/info (?) in both guppy log file :

2021-11-18 10:52:21.777578 [guppy/info] crashpad_handler not supported on this platform.

I am not sure all of these are the source of my issues because I already saw similar info in logs on run I succeeded to call with megalodon but maybe it can lead to something...

One thing that actually changed from the short time I was able to use megalodon to now is that I updated ont-pyguppy-client-lib. On my previous (succeeding) analysis I had this warning :

******************** WARNING: Guppy and pyguppy point versions do not match. This could lead to a failure. Install matching pyguppy version via `pip install ont-pyguppy-client-lib==5.0.11`. **************
******

Weirdly it seems that analysis that showed this warning had also the above "timed up" messages but were manage to complete the methylation calling. I am not sure there is a connection, but it seems the issues I am facing right now appeared right after I upgraded the ont-pyguppy-client.

Thanks for the help, Paul

pterzian commented 2 years ago

Closing this post because it is most probably a system issue