metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
364 stars 97 forks source link

Error in rule #692

Closed bbagy closed 7 months ago

bbagy commented 11 months ago

Hello, I am currently working on setting up the preparation of MAGs, and I have found this pipeline to be quite useful. However, I've encountered a couple of issues that I'm struggling to resolve.

The first issue is related to the file naming: For example, the file name is 30008_S1_L001_R1_001.fastq.gz. The Atlas is interpreting "S1" as the sample name, but it should actually be "30008."

The second issue is an error message, which is as follows:

[Wed Aug 30 09:21:47 2023] rule run_spades: input: S1/assembly/reads/QC.errorcorr.merged_R1.fastq.gz, S1/assembly/reads/QC.errorcorr.merged_R2.fastq.gz, S1/assembly/reads/QC.errorcorr.merged_me.fastq.gz output: S1/assembly/contigs.fasta, S1/assembly/scaffolds.fasta log: S1/logs/assembly/spades.log jobid: 340 benchmark: logs/benchmarks/assembly/spades/S1.txt reason: Missing output files: S1/assembly/scaffolds.fasta wildcards: sample=S1 threads: 8 resources: tmpdir=/tmp, mem=148, time=48, mem_mb=151632, mem_mib=144608, time_min=2880, runtime=2880

Activating conda environment: databases/condaenvs/068dfd63ab07abe329c4244a1b501dfa [Wed Aug 30 09:21:48 2023] Error in rule run_spades: jobid: 340 input: S1/assembly/reads/QC.errorcorr.merged_R1.fastq.gz, S1/assembly/reads/QC.errorcorr.merged_R2.fastq.gz, S1/assembly/reads/QC.errorcorr.merged_me.fastq.gz output: S1/assembly/contigs.fasta, S1/assembly/scaffolds.fasta log: S1/logs/assembly/spades.log (check log file(s) for error details) conda-env: /media/uhlemann/core5Ext/03_MG/Novaseq/20230124_Yael_PLT/atlas/databases/condaenvs/068dfd63ab07abe329c4244a1b501dfa shell: rm -f S1/assembly/pipelinestate/stage*_copy_files 2> S1/logs/assembly/spades.log ; spades.py --threads 8 --memory 148 -o S1/assembly -k auto --meta --pe1-1 S1/assembly/reads/QC.errorcorr.merged_R1.fastq.gz --pe1-2 S1/assembly/reads/QC.errorcorr.merged_R2.fastq.gz --pe1-m S1/assembly/reads/QC.errorcorr.merged_me.fastq.gz --only-assembler >> S1/logs/assembly/spades.log 2>&1 (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Note the path to the log file for debugging. Documentation is available at: https://metagenome-atlas.readthedocs.io Issues can be raised at: https://github.com/metagenome-atlas/atlas/issues Complete log: .snakemake/log/2023-08-30T091347.371565.snakemake.log [Atlas] CRITICAL: Command 'snakemake --snakefile /home/uhlemann/miniconda3/envs/atlas/lib/python3.10/site-packages/atlas/workflow/Snakefile --directory /media/uhlemann/core5Ext/03_MG/Novaseq/20230124_Yael_PLT/atlas --rerun-triggers mtime --jobs 20 --rerun-incomplete --configfile '/media/uhlemann/core5Ext/03_MG/Novaseq/20230124_Yael_PLT/atlas/config.yaml' --nolock --use-conda --conda-prefix /media/uhlemann/core5Ext/03_MG/Novaseq/20230124_Yael_PLT/atlas/databases/conda_envs --resources mem=148 mem_mb=151632 java_mem=125 --scheduler greedy all ' returned non-zero exit status 1.

I would greatly appreciate it if you could take a look and provide assistance in resolving these matters. Thank you.

SilasK commented 11 months ago

1) 30008 is not a valid sample name. It should start with a letter. so atlas simply puts S1 ... . You could name it S30008 but as you are already a t the assembly step I suggest to continue. The name mapping is in the samples.tsv

2) what' is in the S1/logs/assembly/spades.log (check log file(s) for error details) Do you have the 148GB memory?

bbagy commented 11 months ago

I appreciate your comments. I will try to change the file names.

Here is the information about my workstation's RAM. It seems like a lot but might not be enough. total used free shared buff/cache available Mem: 155Gi 13Gi 1.2Gi 4.0Mi 140Gi 140Gi Swap: 4.0Gi 600Mi 3.4Gi

Here is the massage of the spades.log

$ cat spades.log

== Warning == output dir is not empty! Please, clean output directory before run. == Error == file is empty: /media/uhlemann/core5Ext/03_MG/Novaseq/20230124_Yael_PLT/atlas/S1/assembly/reads/QC.errorcorr.merged_R1.fastq.gz (left reads, library number: 1, library type: paired-end)

In case you have troubles running SPAdes, you can write to spades.support@cab.spbu.ru or report an issue on our GitHub repository github.com/ablab/spades Please provide us with params.txt and spades.log files from the output directory.

SilasK commented 11 months ago

Apparently for this sample the merging of reads didn't worked.

Which is a bit odd for normal oaired end reads.

This might be a problem due to an error in the merging stem - check the logs (samplename/logs...)

Check if other samples have the same problem.

You can remove the merging in the config file.

bbagy commented 11 months ago

Upon reviewing the logs of other files, I noticed that no log files were generated. I suspect the pipeline halted after processing this particular file.

Are you suggesting that I change "merge_pairs_before_assembly: true" to "false" in the config.yaml file?

SilasK commented 11 months ago

Check the "{sample}/logs/assembly/pre_process/.. log files of this sample.

Yes you can switch the merge_pairs_before_assembly to false. This should solve the error at hand. But I suspect an other problem up-stream .

bbagy commented 11 months ago

Hi,

I believe I've identified the main issue: I think it primarily revolves around the "simple name" problem. As you can see, the names "S1, S2, S3" are not unique enough to serve as sample identifiers. When I edited the "samples.tsv" file to create more distinct and unique sample names, I was able to successfully execute most of the pipeline.

However, I encountered another problem related to the completion of the pipeline. It was able to progress up to 99.5% before failing.

Could you please take a look at the errors?

Error in rule classify: jobid: 1193 input: genomes/taxonomy/gtdb/align, genomes/genomes output: genomes/taxonomy/gtdb/classify log: logs/taxonomy/gtdbtk/classify.txt, genomes/taxonomy/gtdb/gtdbtk.log (check log file(s) for error details) conda-env: /media/uhlemann/core5Ext/03_MG/Novaseq/Ladas_Probiotics_NV/20230512_Lad_pro_set3_NV_fastq/databases/condaenvs/2d90fedf2dde6cd1884e1dc67b60cef7 shell: export GTDBTK_DATA_PATH="/media/uhlemann/core5Ext/03_MG/Novaseq/Ladas_Probiotics_NV/20230512_Lad_pro_set3_NV_fastq/databases/GTDB_V08_R214" ; gtdbtk classify --genome_dir genomes/genomes --align_dir genomes/taxonomy/gtdb --mash_db /media/uhlemann/core5Ext/03_MG/Novaseq/Ladas_Probiotics_NV/20230512_Lad_pro_set3_NV_fastq/databases/GTDB_V08_R214/mash_db --out_dir genomes/taxonomy/gtdb --tmpdir /tmp --extension fasta --cpus 8 &> logs/taxonomy/gtdbtk/classify.txt (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job classify since they might be corrupted: genomes/taxonomy/gtdb/classify Select jobs to execute...

[Tue Sep 5 11:42:35 2023] localrule all_gtdb_trees: input: genomes/tree/gtdbtk.ar53.nwk, genomes/tree/gtdbtk.bac120.nwk output: genomes/tree/finished_gtdb_trees jobid: 1186 reason: Missing output files: genomes/tree/finished_gtdb_trees resources: tmpdir=/tmp, mem_mb=60000, mem_mib=57221, time_min=300, runtime=300

Touching output file genomes/tree/finished_gtdb_trees. [Tue Sep 5 11:42:35 2023] Finished job 1186. 1 of 5 steps (20%) done Exiting because a job execution failed. Look above for error message Note the path to the log file for debugging. Documentation is available at: https://metagenome-atlas.readthedocs.io Issues can be raised at: https://github.com/metagenome-atlas/atlas/issues Complete log: .snakemake/log/2023-09-05T114228.895263.snakemake.log [Atlas] CRITICAL: Command 'snakemake --snakefile /home/uhlemann/miniconda3/envs/atlas/lib/python3.10/site-packages/atlas/workflow/Snakefile --directory /media/uhlemann/core5Ext/03_MG/Novaseq/Ladas_Probiotics_NV/20230512_Lad_pro_set3_NV_fastq --rerun-triggers mtime --jobs 20 --rerun-incomplete --configfile '/media/uhlemann/core5Ext/03_MG/Novaseq/Ladas_Probiotics_NV/20230512_Lad_pro_set3_NV_fastq/config.yaml' --nolock --use-conda --conda-prefix /media/uhlemann/core5Ext/03_MG/Novaseq/Ladas_Probiotics_NV/20230512_Lad_pro_set3_NV_fastq/databases/conda_envs --resources mem=148 mem_mb=151632 java_mem=125 --scheduler greedy all --keep-going ' returned non-zero exit status 1.

I believe some of the output files required for the next stage of the pipeline are not being generated. I'm wondering if there's a way to run the pipeline with options to "skip" or "ignore" these specific steps and continue with the rest of the process. Could you please inform me any suggestions for how to handle this?

SilasK commented 11 months ago

?? atlas generated a sample.tsv with non-unique names? This should not happened. Could you,please, re run an atlas init and then send the sample.tsv. Either via this issue or my mail, which you find on my webpage.

SilasK commented 11 months ago

Once you edited the sample names. I guess you restart the whole pipeline.

note: running atlas with --keep-going will run all steps that can be run.

You can deactivate the gtdb_taxonomy annotation in the config file.

But you should also

logs/taxonomy/gtdbtk/classify.txt, genomes/taxonomy/gtdb/gtdbtk.log (check log file(s) for error details)

github-actions[bot] commented 9 months ago

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.

bbagy commented 9 months ago

Thank you for your notification. The problem has not been resolved, but you may close this issue for now. I will reach out if any further problems arise.

Berst, Heekuk

On Nov 6, 2023, at 8:27 AM, github-actions[bot] @.***> wrote:

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.

— Reply to this email directly, view it on GitHub https://github.com/metagenome-atlas/atlas/issues/692#issuecomment-1794823571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARMACO3F5QPBQG6YGB67BU3YDDQT3AVCNFSM6AAAAAA4ET7POOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJUHAZDGNJXGE. You are receiving this because you authored the thread.

github-actions[bot] commented 7 months ago

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.