hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
152 stars 23 forks source link

Failed at STEP 7: Execute CESAR jobs #80

Closed hongbingp closed 11 months ago

hongbingp commented 1 year ago

Hi!

I'm running TOGA for human and betta fish genome but it failed at STEP 7: Execute CESAR jobs. It can only excute one nextflow job and then kept reporting like this NOTE: Process 'execute_jobs (288)' terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3). And the error messages indicated that "Cesar output is corrupted"

These are log files: slurm-7283855.err.txt slurm-7283855.txt

My code: ./toga.py /burg/sscc/users/hp2608/data/chain/human_betta2/hg38.betta.allfilled.chain.gz /burg/sscc/users/hp2608/data/hg38/ucsc/hg38.knownGene.bed /burg/sscc/users/hp2608/data/hg38/ucsc/hg38.2bit /burg/sscc/users/hp2608/data/betta_soft/betta_softmasked.2bit --kt --project_dir /burg/sscc/users/hp2608/data/TOGA_results/human-betta --nc nextflow_config_files --nd /burg/sscc/users/hp2608/tmp/nextflow_temp --cb 3,10 --cjn 500 --ms

Could you help with this? Thanks!

Hongbing

kirilenkobm commented 1 year ago

Hi @hongbingp,

That's intriguing. I've never encountered such a bug before. It seems that the CESAR output is deviating from the expected format.

Here are a few steps that could help me identify the problem:

MSLDIQSLDIQCEELSDARWAELLPLLQQCQVVR-LDDCGLTEARCKDISSALR-VNPALAELN-LRSNELGDVGVHCVLQGLQTPSCKIQKLSLQNCCLTGAGCGVLSSTLRTLPTLQELHLSDNLLGDAGLQLLCEGLLDPQCRLEKLQ-LEYCSLSAASCEPLASVLRAKPDFKELT-VSNNDINEAGVRVLCQGLKDSP-CQLEALKLESCGVTSDNCRDLCGIVASKASLRELALGSNKLGDVGMAELCPGLLHPSSRLRTLW--IWECGITAKGCGDLCRVLRAKESLKELSLAGNELGDEGARLLCET-LLEPGCQLESLWVKSCSFTAACCSHFSSVLAQNRFLLELQ-ISNNRLEDAGVREL-CQGLGQPGSVLRVLW---LADCDVSDSSCSSLAATLLANHSLRELDLSNNCLGDAGILQLVESVRQPGCLLEQLVLYDIYWSEEMEDRLQALEKDKPSLRVISX

If you can locate this, please select the corresponding transcript. Then create a reference bed file that only includes this transcript and run TOGA with exactly the same parameters, but use the trimmed reference annotation file. Please then send me the directory (you can exclude the chain file from it).

You can find the CESAR output files in the $toga_output_dir/temp/cesar_results

Looking forward to your response!

hongbingp commented 1 year ago

Hi @kirilenkobm ,

Thank you for your help!

This link contains both input and output files of TOGA, as well as the shell script I used for running TOGA. https://drive.google.com/drive/folders/1_pkHRYcDD11nOm46g-Utvi36X24Z6pgP?usp=drive_link

Let me know if you need more information.

Hongbing

kirilenkobm commented 1 year ago

Hi @hongbingp

that's interesting, I tried to reproduce the issue but my TOGA run finished successfully:

bkirilenko@delta:/projects/hillerlab/genome/src/TOGA_dev/human-betta-reproduce (master)$ ls
codon.fasta                 inact_mut_data.txt  orthology_classification.tsv  prot.fasta            query_gene_spans.bed        t2bit.link
done.status                 loss_summ_data.tsv  proc_pseudogenes.bed          q2bit.link            query_isoforms.tsv          temp
genes_rejection_reason.tsv  nucleotide.fasta    project_args.json             query_annotation.bed  ref_orphan_transcripts.txt  version.txt
bkirilenko@delta:/projects/hillerlab/genome/src/TOGA_dev/human-betta-reproduce (master)$ cat query_annotation.bed | wc -l
143635
bkirilenko@delta:/projects/hillerlab/genome/src/TOGA_dev/human-betta-reproduce (master)$ cat project_args.json 
{"chain_input": "input_repo_issue/hg38.betta.allfilled.chain.gz", "bed_input": "input_repo_issue/hg38.knownGene.bed", "tDB": "/projects/hillerlab/genome/gbdb-HL/hg38/hg38.2bit", "qDB": "input_repo_issue/betta_softmasked.2bit", "project_dir": "/projects/hillerlab/genome/src/TOGA_dev/human-betta-reproduce", "project_name": null, "min_score": 15000, "isoforms": "", "keep_temp": true, "limit_to_ref_chrom": null, "nextflow_dir": null, "nextflow_config_dir": "/projects/hillerlab/genome/src/TOGA_dev/nextflow_config_files/", "do_not_del_nf_logs": false, "cesar_bigmem_config": null, "para": false, "para_bigmem": false, "chain_jobs_num": 100, "no_chain_filter": false, "orth_score_threshold": 0.5, "cesar_jobs_num": 500, "cesar_binary": null, "using_optimized_cesar": false, "output_opt_cesar_regions": false, "mask_stops": true, "cesar_buckets": "10,100", "cesar_exec_seq": false, "cesar_chain_limit": 100, "cesar_mem_limit": 16, "time_marks": null, "u12": null, "stop_at_chain_class": false, "uhq_flank": 50, "o2o_only": false, "no_fpi": false, "disable_fragments_joining": false, "ld_model": false, "annotate_paralogs": false, "mask_all_first_10p": false}bkirilenko@delta:/projects/hillerlab/genome/src/TOGA_dev/human-betta-reproduce (master)$

I will send you the output later. Could you check whether there is some inconsistency in your system? Pls also note - I used the latest TOGA version (1.1.3)

hongbingp commented 1 year ago

Thank you so much!

I used the TOGA 1.1.3 and ran the provided test ./toga.py test_input/hg38.mm10.chr11.chain test_input/hg38.genCode27.chr11.bed ${path_to_human_2bit} ${path_to_mouse_2bit} --kt --pn test -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt --nc ${path_to_nextflow_config_dir} --cb 3,5 --cjn 500 --u12 supply/hg38.U12sites.tsv --ms and it seemed that a similar problem also happened in STEP 7, which reported that Process 'execute_jobs (22)' terminated for an unknown reason -- Likely it has been terminated by the external system. But somehow this test finished and got the output.

Here is the log file of the test. slurm-7436855.txt

I then swithed the nextflow executor of CESAR to 'local' and ran TOGA for human and betta fish. Now STEP 7 can begin to run instead of reporting failure at very beginning. So I wonder if there are some additional parameters I need to set for CESAR so I can run it on slurm?

kirilenkobm commented 1 year ago

This is the longest and most unstable part of the TOGA pipeline. The jobs in this stage are quite heavy and sometimes take longer than expected. Additionally, certain clusters may not handle them well. To compensate for this, TOGA attempts to rerun each CESAR job multiple times. Therefore, it is normal if some CESAR jobs crash, but there should be no issues on the engineering side.

When TOGA runs locally, it utilizes all available CPU cores on the local machine (can be PC, laptop, only the master node of cluster \ also suitable for configurations with numerous CPUs). This setup can work fine for small genomes or small sections of reference annotations. However, in general, it is strongly recommended to use a cluster for better performance.

The error message 'execute_jobs (22)' terminated for an unknown reason -- likely it has been terminated by the external system' does not provide any useful information, to be honest.

To assist further, could you run TOGA with the flag '--do_not_del_nf_logs,' then locate the 'nextflow_logs' directory, compress it, and send it to me?

I'm also planning to release another update for TOGA today, which may improve its stability.

hongbingp commented 1 year ago

Thank you for the information!

I ran the test twice (named test2 and test3) using exactly same script. In STEP 7, test2 failed twice and retried successfully while test 3 failed four times and reported errors .

Slrum log files test2_log.txt test3_log.txt

Nextflow log files nextflow_logs.tar.gz

In addition, I ran the TOGA for mouse and Peromyscus maniculatus, it failed all CESAR jobs but somehow proceeded to the final step and got the results. Is it normal? Can I use the output for analysis? Here is the log file slurm-7490328.txt

Thanks

hongbingp commented 1 year ago

Hi @kirilenkobm

I’ve spent two weeks on troubleshooting but it’s just impossible to get TOGA to run in our cluster. I wonder if you could help me run TOGA for mouse and Peromyscus maniculatus. If so, I can provide the input data. Thank you for all your help!

MichaelHiller commented 1 year ago

Sure, pls email me with a link to the data (we need the genome fasta). If you have a repeatModeler lib for that assembly, we can use that too. Is this assembly on NCBI? e.g. https://www.ncbi.nlm.nih.gov/assembly/GCA_026229955.1 ?

kirilenkobm commented 1 year ago

Hi @hongbingp

Looks like nextflow does not fit all the users. In version 1.1.5 (or 1.1.6), I plan to release another (much better) way to handle parallel jobs. It will be a module structure (following the "strategy" OOP pattern), where I provide users with a class to implement their own way of handling parallel jobs + necessary documentation + examples of how it is implemented for nextflow and para.

This module is to be included in the toga pipeline (for now, it is here, but not attached): https://github.com/hillerlab/TOGA/blob/master/parallel_jobs_manager.py Strategy for para is already implemented, for nextflow (which will be a default) - pretty much. Custom strategy is a template.