metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
364 stars 97 forks source link

atlas metatranscriptome data - MissingOutputException in rule run_spades #696

Closed jilldv closed 11 months ago

jilldv commented 11 months ago

Hi,

I am running atlas (version 2.16.2) on metatranscriptome data. I started a new project for this with atlas init specifying --data-type metatranscriptome and changed the assembler to rnaSPAdes. When running the assembly I got the following error:

MissingOutputException in rule run_spades in file /home/jill/mambaforge/envs/atlasenv/lib/python3.10/site-packages/atlas/workflow/rules/assemble.smk, line 419: Job 1124 completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait: T231/assembly/scaffolds.fasta

On the forum here I found a similar issue explaining this is due to different naming of output files with rnaSPAdes. Following the solution proposed in this issue I modified the file assemble.smk with the code given there, which seemed to work for the author of the issue. However, when I restart atlas I still get the same error message.

I also saw this discussion where you suggest to make symbolic links to the according files. But I am not sure on how to do this, do I make this adjustment in the config file or also in the assemble.smk file?

What modification would I have to make to adapt the naming according with the naming atlas needs?

Thank you for any help you can give.

Best, Jill

jilldv commented 11 months ago

I forgot to put spades_use_scaffolds to false in the config file. Sorry for the mistake, now the pipeline is continuing.

SilasK commented 11 months ago

Thank you for the feedback.

jilldv commented 11 months ago

Sorry to reopen this but somewhat further in the pipeline I got the following error message during the rule filter_genes:

Error in rule filter_genes: jobid: 623 input: T153/annotation/predicted_genes/T153.fna, T153/annotation/predicted_genes/T153.faa output: T153/annotation/predicted_genes/T153.filtered.fna, T153/annotation/predicted_genes/T153.filtered.faa, T153/annotation/predicted_genes/T153.short.faa log: logs/Genecatalog/filter_genes/T153/annotation/predicted_genes/T153.log (check log file(s) for error details) conda-env: /home/jill/atlas_databases/condaenvs/72a46d5227cafddc29306f8a1de5f587

Removing output files of failed job filter_genes since they might be corrupted: T153/annotation/predicted_genes/T153.filtered.fna, T153/annotation/predicted_genes/T153.filtered.faa, T153/annotation/predicted_genes/T153.short.faa Building DAG of jobs... Using shell: /bin/bash Provided cores: 8 Rules claiming more threads will be scaled down. Provided resources: mem_mb=60000, mem_mib=60000, time_min=300 Singularity containers: ignored Select jobs to execute... [Mon Sep 18 10:55:38 2023] Finished job 1727. 5 of 765 steps (1%) done [Mon Sep 18 10:58:52 2023] Finished job 1779. 6 of 765 steps (1%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Note the path to the log file for debugging. Documentation is available at: https://metagenome-atlas.readthedocs.io Issues can be raised at: https://github.com/metagenome-atlas/atlas/issues Complete log: .snakemake/log/2023-09-18T105455.864984.snakemake.log [Atlas] CRITICAL: Command 'snakemake --snakefile /home/jill/mambaforge/envs/atlasenv/lib/python3.10/site-packages/atlas/workflow/Snakefile --directory /home/jill/climartic/metatranscriptomics/atlas/metatrans_genecatalog --rerun-triggers mtime --jobs 8 --rerun-incomplete --configfile '/home/jill/climartic/metatranscriptomics/atlas/metatrans_genecatalog/config.yaml' --nolock --use-conda --conda-prefix /home/jill/atlas_databases/conda_envs --resources mem=200 mem_mb=204800 java_mem=170 --scheduler greedy genecatalog ' returned non-zero exit status 1.

The log file from this rule gives the following: File "/home/jill/climartic/metatranscriptomics/atlas/metatrans_genecatalog/.snakemake/scripts/tmpcsikqxms.filter_genes.py", line 48, in for name, seq, comment in fna_iterator: ValueError: not enough values to unpack (expected 3, got 2)

I checked the faa and fna files but these look ok to me. Do you think this could have anything to do with the adjustments made for running metatranscriptome data (using rna SPAdes and setting spades_use_scaffolds to false)?

I am running the genecatalog workflow on metatranscriptomics data. I attached the log file from the run and the snakemake log file, as well as the faa and fna files T153.log 2023-09-18T105455.864984.snakemake.log T153.zip

SilasK commented 11 months ago

Can you check the two files T153/annotation/predicted_genes/T153.fna, T153/annotation/predicted_genes/T153.faa

I guess the problem is the transcriptome assemly is the genes. and prodigal doesn't really predict correct genes here. I would need to adapt it.

jilldv commented 11 months ago

I attached the files T153/annotation/predicted_genes/T153.fna and T153/annotation/predicted_genes/T153.faa here in a zip file since they were too big to add separately. T153.zip

SilasK commented 11 months ago

The two latext commits to the master branch should fix the error. could you try to install de dev version of atlas and test it.

jilldv commented 11 months ago

The pipeline is running again. Thank you for the help.

jilldv commented 11 months ago

The pipeline continued running until the rule align_reads_to_Genecatalog. Then I got the following error:

rule align_reads_to_Genecatalog: input: ref/Genecatalog.mmi, Intermediate/genecatalog/alignments/T50.fastq.gz output: Genecatalog/alignments/T50.bam log: logs/Genecatalog/alignment/T50_map.log jobid: 2408 reason: Missing output files: Genecatalog/alignments/T50.bam wildcards: sample=T50 threads: 8 resources: tmpdir=/tmp, mem_mb=60000, mem_mib=57221, time_min=300, runtime=300

Activating conda environment: ../../../../atlas_databases/condaenvs/f077e509b9873bc1010104c7314c8556 Environment defines Python version < 3.7. Using Python of the main process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.7 only. Activating conda environment: ../../../../atlas_databases/condaenvs/f077e509b9873bc1010104c7314c8556 Traceback (most recent call last): File "/home/jill/climartic/metatranscriptomics/atlas/metatrans_genecatalog/.snakemake/scripts/tmpiiq67h79.wrapper.py", line 50, in shell( File "/home/jill/mambaforge/envs/atlas-dev/lib/python3.11/site-packages/snakemake/shell.py", line 294, in new raise sp.CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'set -euo pipefail; (minimap2 -t 8 -x sr --split-prefix T50split -a ref/Genecatalog.mmi Intermediate/genecatalog/alignments/T50.fastq.gz | samtools view -h --threads 7 --output-fmt BAM > Genecatalog/alignments/T50.bam) 2> logs/Genecatalog/alignment/T50_map.log' returned non-zero exit status 1. [Tue Sep 26 08:22:26 2023] Error in rule align_reads_to_Genecatalog: jobid: 2408 input: ref/Genecatalog.mmi, Intermediate/genecatalog/alignments/T50.fastq.gz output: Genecatalog/alignments/T50.bam log: logs/Genecatalog/alignment/T50_map.log (check log file(s) for error details) conda-env: /home/jill/atlas_databases/condaenvs/f077e509b9873bc1010104c7314c8556

Removing output files of failed job align_reads_to_Genecatalog since they might be corrupted: Genecatalog/alignments/T50.bam Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Note the path to the log file for debugging. Documentation is available at: https://metagenome-atlas.readthedocs.io Issues can be raised at: https://github.com/metagenome-atlas/atlas/issues Complete log: .snakemake/log/2023-09-26T081702.465253.snakemake.log [Atlas] CRITICAL: Command 'snakemake --snakefile /home/jill/atlas/atlas/workflow/Snakefile --directory /home/jill/climartic/metatranscriptomics/atlas/metatrans_genecatalog --rerun-triggers mtime --jobs 8 --rerun-incomplete --configfile '/home/jill/climartic/metatranscriptomics/atlas/metatrans_genecatalog/config.yaml' --nolock --use-conda --conda-prefix /home/jill/atlas_databases/conda_envs --resources mem=200 mem_mb=204800 java_mem=170 --scheduler greedy genecatalog ' returned non-zero exit status 1.

The log file from this rule gives the following error: [E::sam_hrecs_update_hashes] Duplicate entry "Gene079411" in sam header samtools view: failed to add PG line to the header

My guess is this is also related to the fact that the assembly is already the genes? I found a similar error message on the samtools forum where they explain this is because two reads have the same name. But I have no idea on how to solve this or work around this?

I attached the logs/Genecatalog/alignment/T50_map.log file of the run and the snakemake log file 2023-09-26T081702.465253.snakemake.log T50_map.log

Thank you for any help you could give on this and my apologies for all the questions.

SilasK commented 11 months ago

could you check if this is true

grep Gene079411 Genecatalog/Genecatalog.fna or grep Gene079411 Genecatalog/clustering/representative2genenr.tsv

if this is the case: I guess if you rm -r Genecatalog/Genecatalog.f* and restart atlas should re generate a genecatalog that is ok. If not you might want to delete the whole folder rm -r Genecatalog

alternatively: you might simply delete the minimap index ref/Genecatalog.mmi and restart.

jilldv commented 10 months ago

I deleted the Genecatalog folder and restarted Atlas but it gives the same error message.

SilasK commented 10 months ago

And the genecatalog contains twice the same gene? check with grep.

jilldv commented 10 months ago

Sorry I missed that part. I checked with grep now and it gives only 1 match, which seems strange since samtools complains about a duplicated sequence. grep Gene217115 Genecatalog/gene_catalog.fna >Gene217115 T1_0_2

SilasK commented 10 months ago

Could you check if the mmi index is also updated, e.g. if the timestamp is newer than that of the genecatalog.

ls -l  Genecatalog/gene_catalog.fna
ls -l ref/Genecatalog.mmi

If this is not the case you need to delete the mmi index.

Maybe it is a problem with the reads as in the link you send.

The mapping of the genecatalog combines R1,R2 and se reads into a single fastq for mapping. ( I don't think it makes sense to map paired-end reads to genes..)

Does the log say something about which read is duplicated? Check the head of all three fastq.gz files of a sample. zcat path/to/fastq.gz | head

In theory, there should be a suffix to the read names that distinguishes R1 and R2.

If it is the case. Try to delete the se reads from the sample table. (make a backup and drop the column) then restart.

If it is not the case. Please report it. then there is a bug in atlas. using reformat.sh you can add this suffix. But then atlas thinks you have new reads and want's to restart from QC. So you would need to run. atlas run genecatalog --touch to update all timestamps. Which can take some time.

jilldv commented 10 months ago

The mmi index gives the same timestamp as the genecatalog so I guess it is updated. But I don't see any suffix to the read names related to R1 or R2 in the fastq.gz files. zcat T1/sequence_quality_control/T1_QC_R1.fastq.gz | head @NGSNJ-086:333:GW200617619th:4:2143:6479:18693 TACGTTCCCGGCAACGAGGCTCAACTTTCGTTGAGCGACATTGCTCACCACAAGCGGGTCCTCTCTCGTGGTTGCTACCCGTTTTTATCCGGGCGCTCAGAAAAACCTCTCGGTCTTTCCTGGCAAGCCTTTCGGCTTCAAGTAGGGCGT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ @NGSNJ-086:333:GW200617619th:2:1476:20980:13338 TACGTTCCCGGCAACGAGGCTCAACTTTCGTTGAGCGACATTGCTCGCCACAAGCGGGTCCTCTCTCGTGGTTGCTACCCGTTTTTATCCGGGCGCTCAGAAAAACCTCTCGGTCTTTCCTGGCAAGCCTTTCGGCCTCAACTAAGGGCG + FFFFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJ @NGSNJ-086:333:GW200617619th:3:1234:2899:26584 CGTTCCCGGCAACGAGGCTCAACTTTCGTTGAGCGACATTGCTCACCACAAGCGGGTCCTTATCTCGTGGTTGCTACCCGTTTTTATCCGGGCGCTCAGAAAAACCTCTCGGTCTTTCCTGGCAAGCCTTTCGGCTTCAACTAGGGCGTC zcat T1/sequence_quality_control/T1_QC_R2.fastq.gz | head @NGSNJ-086:333:GW200617619th:4:2143:6479:18693 GTTCTTGAGACCAACAAAGGGATCGGCGAACGGAGTCGTGAGGCTCTGGGAGAGAGCAAAAGCAGAGCGGTTTGCAGTTGCTATTGACGCCCTACTTGAAGCCGAAAGGCTTGCCAGGAAAGACCGAGAGGTTTTTCTGAGCGCCCGGAT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ @NGSNJ-086:333:GW200617619th:2:1476:20980:13338 CTATTGACGCCCTTAGTTGAGGCCGAAAGGCTTGCCAGGAAAGACCGAGAGGTTTTTCTGAGCGCCCGGATAAAAACGGGTAGCAACCACGAGAGAGGACCCGCTTGTGGCGAGCAATGTCGCTCAACGAAAGTTGAGCCTCGTTGCCGG + FFFFFFFJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ @NGSNJ-086:333:GW200617619th:3:1234:2899:26584 CTTGAGACCAACGAAGGGATCGGCGAACGGAGTCGTGAGGCTCTGGGAGAGAGCAAAAGCAGAGCGGTTTGCAGTTGCTATTGACGCCCTAGTTGAAGCCGAAAGGCTTGCCAGGAAAGACCGAGAGGTTTTTCTGAGCGCCCGGATAAA zcat T1/sequence_quality_control/T1_QC_se.fastq.gz | head @NGSNJ-086:333:GW200617619th:2:1645:8006:10285 GAAGCTCGGCCTTGCGTTGCAGGATGGCATTGGCGGCGCGGGAGACCTGATGGTCGGCGAAAGCCTTTTCGGGTGGGCGGGCGCAGTGCTGTCGTGGGTTGGGGCCGGGCTTTTTGTAGTCTGGTAGCAAATGAAAAAAAAAAAAAGTAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFF:,F @NGSNJ-086:333:GW200617619th:2:2542:28691:3834 AAAACAAGCAAAGAAAATCGATAAACCGATTAAATAAAGACTATGGATAAAAGTAAAGCAACAAATAGAAAAAAATAATCAAACAAAAGTTTCAAAATAATAAAGACAGGAAAAATTAATAATTGTAAGATATGAGTACAAGAAACAAAG + FF:FFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFF,FF:FFFFFFFFFFFFFFFFFFFF:FF:FFFFF::FFFFFFFF,FFFFFFFFFFJJJJJJJJJJJHJJJ<JJJJJJHJHJJJJJJJJJJJJJJJJJJJJJJJ<HJJJJHJJ @NGSNJ-086:333:GW200617619th:3:2626:31955:5040 AAAAATAATAAAAAAAATATTATAGATGACCAAAAAAAATTTTAACTTAAACAAAAAAGAAAAAGAAAAAATAAAAAAATAAATCAAGCAAGAAAATTAAAAAAAATAATAAAAACA

The log only says there is a duplicated sequence, but not which read.

So I should run reformat.sh on the R1 and R2 fastq.gz files of all samples? Are there any additional parameters I need to add to reformat or just the names of the in and out fastq files?

SilasK commented 10 months ago

Yes there are some other parameters to add but I don't know by heart. Type reformat.sh to see them.

You did qc with atlas, did you ?

I will see at what stage these suffix should have been added.

jilldv commented 10 months ago

I added the suffix to the R1 and R2 fastq files and restarted atlas. Now the pipeline finally finished without further problems. Should I still report this (and how do I do this)?

Thank you very much for all the help!

SilasK commented 10 months ago

Thank you for the feedback thisnis enough.

SilasK commented 10 months ago

You did the qc also with atlas? In the default parameters there is already the option to add the slash to the reads. I don't know what went wrong. But happy that it works now...

jilldv commented 10 months ago

Yes I did qc also with atlas, I didn't change any parameter for qc. It is a bit strange indeed but fortunately it could be solved.