Some nextflow processes died

molinfzlvvv commented 1 month ago

Hi,I have a few problems, hope to get your help.

my command is : ./toga.py /home/TOGAInput/query/hg38.H.g.final.chain /home/TOGAInput/human_hg38/toga.transcripts.bed /home/TOGA/hg38.2bit /home/TOGA/query/H.g.2bit --kt --pn /opt/synData2/Hg -i /home/TOGAInput/human_hg38/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,100 --cjn 300 --u12 /home/TOGAInput/human_hg38/toga.U12introns.tsv --ms -q

When I was working on CESAR job, the following error occurred:

Compiling C code... Model found CESAR installation found Traceback (most recent call last): File "/home/TOGA/./toga.py", line 1600, in main() File "/home/TOGA/./toga.py", line 1596, in main toga_manager.run() File "/home/TOGA/./toga.py", line 530, in run self.check_cesar_completeness() File "/home/TOGA/./toga.py", line 1088, in check_cesar_completeness monitor_jobs(jobs_managers, die_if_sc_1=True) File "/home/TOGA/modules/parallel_jobs_manager_helpers.py", line 36, in monitor_jobs raise AssertionError(err) AssertionError: Error! Some para/nextflow processes died!

The log file section is as follows：

Checking whether all CESAR results are complete 1 CESAR jobs crashed, trying to run again... !!RERUN CESAR JOBS: Pushing 1 jobs into None GB queue Selected parallelization strategy: nextflow Parallel manager: pushing job nextflow /home/TOGA/execute_joblist.nf --joblist /opt/synData2/Hg/_cesar_rerun_batch_None -c /opt/synData2/Hg/temp/cesar_config_16_queue.nf Monitoring CESAR jobs rerun ## Stated polling cluster jobs until they done Polling iteration 0; already waiting 0 seconds. Polling iteration 1; already waiting 60 seconds. Polling iteration 2; already waiting 120 seconds. Polling iteration 3; already waiting 180 seconds. ....... Polling iteration 48; already waiting 2880 seconds. Polling iteration 49; already waiting 2940 seconds. ### CESAR jobs done ###

It's worth noting that this error occurs frequently. Sometimes, running it a second time with the same instructions might work, but each run often requires a significant time investment. Do you have any suggestions for addressing this issue?

Best regards!

kirilenkobm commented 1 month ago

Hi! I am sorry for that, feels like I implemented quite aggressive strategy here. After TOGA tries to execute its CESAR jobs, it collects those that crashed (which may happen due to a variety of reasons) and pushes them again. I any job dies, the TOGA process dies as well. Will be disabled in the next commit (in a couple of minutes)

molinfzlvvv commented 1 month ago

Hi！ Thank for your response. Could you help to look at this problem again? This bothered me for a long time. https://github.com/hillerlab/TOGA/issues/140#issuecomment-2126045291 My task has been running for ten days, and it has been in the process called "### STEP 7: Execute CESAR jobs: parallel step".Paradoxically, it seems to be working just fine, because the log file keeps growing.

In fact, I applied for a node with 40 cpus, and then I changed the nextflow setting to process.cpus = 40 // SLURM config file for CESAR jobs, but it actually looks like it only utilizes 2 cpus. I don't know why it's not using all the resources, is that why it's so slow?

If you can suggest any commands to speed up the process, I would really appreciate it.

molinfzlvvv commented 3 weeks ago

Hi! @kirilenkobm

I am very sorry to bother you many times, so far I have not successfully run an instance. I actually tried a lot, and I couldn't commit it to the slurm system, it kept reporting errors. So now I'm running TOGA on a master node with 40 cores, divided into two buckets based on memory(--cn 10,100). I expect to be able to use all the CPUs at CESAR, but I'm only using two CPUs. It's working normally just too slow, and it seems like it can only run one and then move on to the next at CESAR, which has been working for over a week. Do you have any suggestions for this, which I would appreciate very much.

In addition, I noticed that when I ran CESAR in the 10 and 100 buckets, it was not run in command-line order, because the output did not match the order in the cesar_joblist_queue_10.txt file. What's the reason for this, because if I ran it in order, I could also know where I was running, How much longer?

By the way, my nextflow is 21.10.6.5660 and I git clone TOGA directly.Looking forward to your reply.

Best regards!

hillerlab / TOGA

Some nextflow processes died #161