Closed siduanmiao closed 6 months ago
I think the reason might be the RAM of GPU is been exceeded and the process is wrong
The RAM of GPU is now >20G, but still take this error: python3: tpp.c:84: __pthread_tpp_change_priority: Assertion new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: *** [makefiles/common.mk:172: tmp_static/Subset1/net_weight.tsv.gz] Error 134.
This appears to be a bug in one of the dependencies which could be python packages or else and we cannot change.
A google search gives the following: https://stackoverflow.com/questions/21825291/threading-issues and https://forum.parallels.com/threads/linux-client-crashes-on-reconnect.353975/ . If the problem does not appear every time, you can run make -f makefiles/static.mk -j 2 -k gpu || true
multiple times so the error no longer occurs before moving on to the next step. You can also try -j 1
flag instead of -j 2
to run the jobs serially.
Please let me know if that helps.
Best, Lingfei
Thanks you for your advice.Though -j 1 is not so useful for me, some Subset runs successfully and others failed. So I think I will try a new environment to solve this problem. The environment I used is the publict environment of administrater, so I decide to create a new environment to run this step for me to solve possible dependencies error. If I solved this problem I will share it for you, thank you for your help and excellent work!
Hi, I try the new environment and still not resolve the problem. So I try the following command: !cd ..; dictys_helper network_inference.sh -j 64 -J 1 static. It not raise any error but it has been running Subset12 for 12 hours. I check the network_inference.sh code and find the GPU number is 1. So it is very strange because the Subset1,Subset10,Subset11 was finish very quickly.
This is my last four lines of standard output:
OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset1/expression.tsv.gz tmp_static/Subset1/binlinking.tsv.gz tmp_static/Subset1/net_weight.tsv.gz tmp_static/Subset1/net_meanvar.tsv.gz tmp_static/Subset1/net_covfactor.tsv.gz tmp_static/Subset1/net_loss.tsv.gz tmp_static/Subset1/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset10/expression.tsv.gz tmp_static/Subset10/binlinking.tsv.gz tmp_static/Subset10/net_weight.tsv.gz tmp_static/Subset10/net_meanvar.tsv.gz tmp_static/Subset10/net_covfactor.tsv.gz tmp_static/Subset10/net_loss.tsv.gz tmp_static/Subset10/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset11/expression.tsv.gz tmp_static/Subset11/binlinking.tsv.gz tmp_static/Subset11/net_weight.tsv.gz tmp_static/Subset11/net_meanvar.tsv.gz tmp_static/Subset11/net_covfactor.tsv.gz tmp_static/Subset11/net_loss.tsv.gz tmp_static/Subset11/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:1 --nth 4 tmp_static/Subset12/expression.tsv.gz tmp_static/Subset12/binlinking.tsv.gz tmp_static/Subset12/net_weight.tsv.gz tmp_static/Subset12/net_meanvar.tsv.gz tmp_static/Subset12/net_covfactor.tsv.gz tmp_static/Subset12/net_loss.tsv.gz tmp_static/Subset12/net_stats.tsv.gz
I guess the Subset1,10,12 is stopped wrong but not give the standard error? Or the Subset12 is just slow. How long of each Subset is correct? Thank you for your help
I have solved this problem, there must be some error in my dependencies so I can't use GPU 1 2 3..., When I change GPU to cuda:0, it works properly. And the result is better than former. Thank you again and I wish I will cite your work in my following paper!
Great to know it works! Thanks for letting us know, siduanmiao!
Checks before submitting the issue
Describe the error When I running
make -f makefiles/static.mk -j 2 -k gpu || true
I get the error:$ make -f makefiles/static.mk -j 2 -k gpu || true OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset1/expression.tsv.gz tmp_static/Subset1/binlinking.tsv.gz tmp_static/Subset1/net_weight.tsv.gz tmp_static/Subset1/net_meanvar.tsv.gz tmp_static/Subset1/net_covfactor.tsv.gz tmp_static/Subset1/net_loss.tsv.gz tmp_static/Subset1/net_stats.tsv.gz OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset10/expression.tsv.gz tmp_static/Subset10/binlinking.tsv.gz tmp_static/Subset10/net_weight.tsv.gz tmp_static/Subset10/net_meanvar.tsv.gz tmp_static/Subset10/net_covfactor.tsv.gz tmp_static/Subset10/net_loss.tsv.gz tmp_static/Subset10/net_stats.tsv.gz python3: tpp.c:84: pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: *** [makefiles/common.mk:172: tmp_static/Subset1/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset11/expression.tsv.gz tmp_static/Subset11/binlinking.tsv.gz tmp_static/Subset11/net_weight.tsv.gz tmp_static/Subset11/net_meanvar.tsv.gz tmp_static/Subset11/net_covfactor.tsv.gz tmp_static/Subset11/net_loss.tsv.gz tmp_static/Subset11/net_stats.tsv.gz python3: tpp.c:84: pthread_tpp_change_priority: Assertion
new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: *** [makefiles/common.mk:172: tmp_static/Subset11/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset12/expression.tsv.gz tmp_static/Subset12/binlinking.tsv.gz tmp_static/Subset12/net_weight.tsv.gz tmp_static/Subset12/net_meanvar.tsv.gz tmp_static/Subset12/net_covfactor.tsv.gz tmp_static/Subset12/net_loss.tsv.gz tmp_static/Subset12/net_stats.tsv.gz python3: tpp.c:84: __pthread_tpp_change_priority: Assertion
new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: [makefiles/common.mk:172: tmp_static/Subset12/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset13/expression.tsv.gz tmp_static/Subset13/binlinking.tsv.gz tmp_static/Subset13/net_weight.tsv.gz tmp_static/Subset13/net_meanvar.tsv.gz tmp_static/Subset13/net_covfactor.tsv.gz tmp_static/Subset13/net_loss.tsv.gz tmp_static/Subset13/net_stats.tsv.gz python3: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped) make: [makefiles/common.mk:172: tmp_static/Subset13/net_weight.tsv.gz] Error 134 OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys network reconstruct --device cuda:3 --nth 4 tmp_static/Subset14/expression.tsv.gz tmp_static/Subset14/binlinking.tsv.gz tmp_static/Subset14/net_weight.tsv.gz tmp_static/Subset14/net_meanvar.tsv.gz tmp_static/Subset14/net_covfactor.tsv.gz tmp_static/Subset14/net_loss.tsv.gz tmp_static/Subset14/net_stats.tsv.gz