Closed euhruska closed 6 years ago
Can you attach the entire log please? The traceback alone isn't always helpful.
happened again, fatal. Sometimes restarting helps, but this issue sometimes persists for days.
Thanks Eugen, looking into this now. Can you tell me how many concurrent tasks you have and how long each task runs for?
each task is 24h long and about 5-8 concurrent task
Hmmm, that's odd. I see about 100 tasks that are submitted for concurrent execution from the log. Can you point me to your script and the specific parameters that you use please.
wait, i meant 100 tasks, but 5-8 concurrent independent extasy runs
the md step is the most concurrent step: https://github.com/ClementiGroup/extasy-koopman/blob/master/extasy_tica3.py#L126
I think I understand why this is happening. Can you set an environment variable ENTK_HB_INTERVAL=90? That's a short term fix. Long term fix is https://github.com/radical-cybertools/radical.entk/issues/270. I'll try to get to the ticket over the weekend. I'll keep you posted.
Basically, the verbose printing to stdout/stderr, task creating and submission are interfering with a timeout. The more you print (or more the number of tasks (>=4K tasks) - which is not the case in your trials), the longer it takes. Additionally, you might want to consider trying one individual run to keep the load on your machine low.
thank you, sounds good
Hey Eugen, did the HB_INTERVAL setup help? The long term fix would take more time than I had initially estimated unfortunately. How are you experiments going?
Haven't seen this error since. Sometimes I have to restart everything to fix the improver termination issue, but currently my main issues are https://github.com/radical-collaboration/extasy-grlsd/issues/98 (fatal) and https://github.com/radical-collaboration/extasy-grlsd/issues/95 (nice to have). Beside that running more iterations to get convergence.
Great, glad to hear that. I have responded to #98 and Andre will probably ping back with suggestions on #95 at the end of the week.
from the same rabbitmq I launch several entk runs, some work, but some fail with connection issues as below. My question is how can another entk run on the same computer, same rabbitmq work while this fails?