pinellolab / dictys

Context specific and dynamic gene regulatory network reconstruction and analysis
GNU Affero General Public License v3.0
101 stars 13 forks source link

ERROR when running 2.1 GPU part, execution #56

Closed siduanmiao closed 2 months ago

siduanmiao commented 2 months ago

Checks before submitting the issue

Describe the error When I running make -f makefiles/static.mk -j 2 -k gpu || true I get the error:

$ make -f makefiles/static.mk -j 2 -k gpu || true OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset1/expression.tsv.gz tmp_static/Subset1/binlinking.tsv.gz tmp_static/Subset1/net_weight.tsv.gz tmp_static/Subset1/net_meanvar.tsv.gz tmp_static/Subset1/net_covfactor.tsv.gz tmp_static/Subset1/net_loss.tsv.gz tmp_static/Subset1/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset10/expression.tsv.gz tmp_static/Subset10/binlinking.tsv.gz tmp_static/Subset10/net_weight.tsv.gz tmp_static/Subset10/net_meanvar.tsv.gz tmp_static/Subset10/net_covfactor.tsv.gz tmp_static/Subset10/net_loss.tsv.gz tmp_static/Subset10/net_stats.tsv.gz python3: tpp.c:84: pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: *** [makefiles/common.mk:172: tmp_static/Subset1/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset11/expression.tsv.gz tmp_static/Subset11/binlinking.tsv.gz tmp_static/Subset11/net_weight.tsv.gz tmp_static/Subset11/net_meanvar.tsv.gz tmp_static/Subset11/net_covfactor.tsv.gz tmp_static/Subset11/net_loss.tsv.gz tmp_static/Subset11/net_stats.tsv.gz python3: tpp.c:84: pthread_tpp_change_priority: Assertion new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: *** [makefiles/common.mk:172: tmp_static/Subset11/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset12/expression.tsv.gz tmp_static/Subset12/binlinking.tsv.gz tmp_static/Subset12/net_weight.tsv.gz tmp_static/Subset12/net_meanvar.tsv.gz tmp_static/Subset12/net_covfactor.tsv.gz tmp_static/Subset12/net_loss.tsv.gz tmp_static/Subset12/net_stats.tsv.gz python3: tpp.c:84: __pthread_tpp_change_priority: Assertionnew_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: [makefiles/common.mk:172: tmp_static/Subset12/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset13/expression.tsv.gz tmp_static/Subset13/binlinking.tsv.gz tmp_static/Subset13/net_weight.tsv.gz tmp_static/Subset13/net_meanvar.tsv.gz tmp_static/Subset13/net_covfactor.tsv.gz tmp_static/Subset13/net_loss.tsv.gz tmp_static/Subset13/net_stats.tsv.gz python3: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: [makefiles/common.mk:172: tmp_static/Subset13/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset14/expression.tsv.gz tmp_static/Subset14/binlinking.tsv.gz tmp_static/Subset14/net_weight.tsv.gz tmp_static/Subset14/net_meanvar.tsv.gz tmp_static/Subset14/net_covfactor.tsv.gz tmp_static/Subset14/net_loss.tsv.gz tmp_static/Subset14/net_stats.tsv.gz

siduanmiao commented 2 months ago

I think the reason might be the RAM of GPU is been exceeded and the process is wrong

siduanmiao commented 2 months ago

The RAM of GPU is now >20G, but still take this error: python3: tpp.c:84: __pthread_tpp_change_priority: Assertion new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: *** [makefiles/common.mk:172: tmp_static/Subset1/net_weight.tsv.gz] Error 134.

lingfeiwang commented 2 months ago

This appears to be a bug in one of the dependencies which could be python packages or else and we cannot change.

A google search gives the following: https://stackoverflow.com/questions/21825291/threading-issues and https://forum.parallels.com/threads/linux-client-crashes-on-reconnect.353975/ . If the problem does not appear every time, you can run make -f makefiles/static.mk -j 2 -k gpu || true multiple times so the error no longer occurs before moving on to the next step. You can also try -j 1 flag instead of -j 2 to run the jobs serially.

Please let me know if that helps.

Best, Lingfei

siduanmiao commented 2 months ago

Thanks you for your advice.Though -j 1 is not so useful for me, some Subset runs successfully and others failed. So I think I will try a new environment to solve this problem. The environment I used is the publict environment of administrater, so I decide to create a new environment to run this step for me to solve possible dependencies error. If I solved this problem I will share it for you, thank you for your help and excellent work!

siduanmiao commented 2 months ago

Hi, I try the new environment and still not resolve the problem. So I try the following command: !cd ..; dictys_helper network_inference.sh -j 64 -J 1 static. It not raise any error but it has been running Subset12 for 12 hours. I check the network_inference.sh code and find the GPU number is 1. So it is very strange because the Subset1,Subset10,Subset11 was finish very quickly.

This is my last four lines of standard output:

OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset1/expression.tsv.gz tmp_static/Subset1/binlinking.tsv.gz tmp_static/Subset1/net_weight.tsv.gz tmp_static/Subset1/net_meanvar.tsv.gz tmp_static/Subset1/net_covfactor.tsv.gz tmp_static/Subset1/net_loss.tsv.gz tmp_static/Subset1/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset10/expression.tsv.gz tmp_static/Subset10/binlinking.tsv.gz tmp_static/Subset10/net_weight.tsv.gz tmp_static/Subset10/net_meanvar.tsv.gz tmp_static/Subset10/net_covfactor.tsv.gz tmp_static/Subset10/net_loss.tsv.gz tmp_static/Subset10/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset11/expression.tsv.gz tmp_static/Subset11/binlinking.tsv.gz tmp_static/Subset11/net_weight.tsv.gz tmp_static/Subset11/net_meanvar.tsv.gz tmp_static/Subset11/net_covfactor.tsv.gz tmp_static/Subset11/net_loss.tsv.gz tmp_static/Subset11/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset12/expression.tsv.gz tmp_static/Subset12/binlinking.tsv.gz tmp_static/Subset12/net_weight.tsv.gz tmp_static/Subset12/net_meanvar.tsv.gz tmp_static/Subset12/net_covfactor.tsv.gz tmp_static/Subset12/net_loss.tsv.gz tmp_static/Subset12/net_stats.tsv.gz

I guess the Subset1,10,12 is stopped wrong but not give the standard error? Or the Subset12 is just slow. How long of each Subset is correct? Thank you for your help

siduanmiao commented 2 months ago

I have solved this problem, there must be some error in my dependencies so I can't use GPU 1 2 3..., When I change GPU to cuda:0, it works properly. And the result is better than former. Thank you again and I wish I will cite your work in my following paper!

lingfeiwang commented 2 months ago

Great to know it works! Thanks for letting us know, siduanmiao!