metagenlab / MeSS

Snakemake pipeline for simulating shotgun metagenomic samples
https://metagenlab.github.io/MeSS
MIT License
18 stars 2 forks source link

Error in rule collect_taxa_summaries #27

Closed teojcryan closed 3 weeks ago

teojcryan commented 1 month ago

Thanks for developing and maintaining this tool!

MeSS version

mess, version 0.9.1 compiled from source as of 2024-10-24.

Describe the bug

Ran into the following errors when running mess hmp-template.

Error in rule download_assemblies:
    jobid: 3
    input: data/buccal_mucosa/uniq_entries.tsv
    output: data/buccal_mucosa/assembly_finder/assembly_summary.tsv, data/buccal_mucosa/assembly_finder/sequence_report.tsv, data/buccal_mucosa/assembly_finder/taxonomy.tsv
    log: data/buccal_mucosa/logs/assembly_finder.log (check log file(s) for error details)
    conda-env: /path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_
    shell:

        assembly_finder -i data/buccal_mucosa/uniq_entries.tsv \
        --taxonkit /path/.taxonkit \
        --threads 4 \
        --taxon --limit 1 --api-key ab2b5172ab3d7a4931c59a444fbc1da51008 --compressed True --include genome,seq-report --source all --reference True --annotated False --atypical True --mag all \
        --no-use-conda \
        -o data/buccal_mucosa/assembly_finder 2> data/buccal_mucosa/logs/assembly_finder.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
localrule collect_taxa_summaries:
    input: [# list of all json files e.g. data/buccal_mucosa/assembly_finder/json/587.json]
    output: data/buccal_mucosa/assembly_finder/genome_summaries.json
    jobid: 6
    reason: Missing output files: data/buccal_mucosa/assembly_finder/genome_summaries.json; Input files updated by another job: [# list of all json files e.g. data/buccal_mucosa/assembly_finder/json/1760811.json]
    resources: tmpdir=/tmp

host: [...]
RuleException:
ValueError in file /rds/projects/w/wheelern-wwmgs/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/assembly_finder/workflow/rules/functions.smk, line 13:
If using all scalar values, you must pass an index
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/assembly_finder/workflow/rules/download.smk", line 73, in __rule_collect_taxa_summaries
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/assembly_finder/workflow/rules/functions.smk", line 13, in read_json
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/io/json/_json.py", line 815, in read_json
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/io/json/_json.py", line 1025, in read
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/io/json/_json.py", line 1187, in parse
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/io/json/_json.py", line 1402, in _parse
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/core/frame.py", line 778, in __init__
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
  File "/path/.snakemake/conda/301ec0f6d2247aeb4ca88d0f25d5e7e8_/lib/python3.12/site-packages/pandas/core/internals/construction.py", line 667, in _extract_index
[Fri Oct  4 13:06:15 2024]
Error in rule collect_taxa_summaries:
    jobid: 0
    input: [# list of all json files e.g. data/buccal_mucosa/assembly_finder/json/587.json]
    output: data/buccal_mucosa/assembly_finder/genome_summaries.json

Minimal example

mess hmp-template --site buccal_mucosa -o outdir --threads 8

Additional context The process completed 507 of 520 steps (98%) before encountering this error and exiting.

farchaab commented 1 month ago

Hello @teojcryan and thank you for using mess !

By default, assembly_finder searches for reference genomes for each taxa, which sometimes, aren't available (taxa with no reference genomes yield empty json files which cause assembly_finder to crash).

A quick solution would be to change your command to:

mess hmp-template --site buccal_mucosa -o outdir --threads 8 --reference False

Alternatively, I replaced taxids with genome accessions, so that downloads are faster and reproducible in this branch.

Can you try to re-install mess from this branch and let me know if it fixed your issue ?

teojcryan commented 1 month ago

Thanks @farchaab!

I've re-installed mess from the 'hmp-template' branch and the downloads do seem to be a lot faster. However, I'm running into a new but related error which seems to stem from an attempt to convert non-finite values (like NA or inf) to integers during a Pandas astype() operation. Specifically, it's failing on the column 'number_of_contigs', which likely contains NaNs or infinite values, causing the type casting to integers to fail.

localcheckpoint calculate_genome_coverages:
    input: data/buccal_mucosa/replicates.tsv, data/buccal_mucosa/assembly_finder/assembly_summary.tsv
    output: data/buccal_mucosa/coverages.tsv
    jobid: 8
    reason: Missing output files: <TBD>
    resources: tmpdir=/tmp, mem_mb=2000, mem_mib=1908, mem=2000MB, time=00:00:10
DAG of jobs will be updated after completion.

Traceback (most recent call last):
  File "/path/.snakemake/scripts/tmpeh4fbcap.calculate_cov.py", line 185, in <module>
    df = df.astype(
         ^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/generic.py", line 6620, in astype
    res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/generic.py", line 6643, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 430, in astype
    return self.apply(
           ^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 363, in apply
    applied = getattr(b, f)(**kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 758, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 237, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
    values = _astype_nansafe(values, dtype, copy=copy)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 101, in _astype_nansafe
    return _astype_float_to_int_nansafe(arr, dtype, copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 145, in _astype_float_to_int_nansafe
    raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'number_of_contigs'
RuleException:
CalledProcessError in file /path/tools/MeSS/mess/workflow/rules/processing/coverages.smk, line 38:
Command 'set -euo pipefail;  /path/.conda/envs/snakemake/bin/python3.12 /path/.snakemake/scripts/tmpeh4fbcap.calculate_cov.py' returned non-zero exit status 1.
[Tue Oct  8 10:24:17 2024]
Error in rule calculate_genome_coverages:
    jobid: 8
    input: data/buccal_mucosa/replicates.tsv, data/buccal_mucosa/assembly_finder/assembly_summary.tsv
    output: data/buccal_mucosa/coverages.tsv

Also appreciate that you've replaced the taxids with genome accessions - can I check if you've done this for all the hmp-template sites or just the buccal mucosa? Ideally I'd like to be working with the gut site data. Just checked and saw that this was updated for the gut site too - thanks!

farchaab commented 1 month ago

Thanks for testing out the branch @teojcryan !

I am glad that the downloads are faster !

As you said, the bug in the rule calculate_genome_coverages stems from having NaN in the number_of_contigs column. Sometimes, this information is missing from the assembly_summary.tsv table.

Since this column is not useful for read simulation, I removed it with this commit e9ab8698f474b66a9eaa5984ae2031872c824572 (You can still see the contigs number in assembly_finder/assembly_summary.tsv).

Let me know if this solved your issue !

teojcryan commented 1 month ago

Thanks for the quick response @farchaab, appreciate it.

The commit has resolved the issue with the number_of_contigs column but now the same issue is coming up with the total_sequence_length column - I'm guessing the fix would be similar!

pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'total_sequence_length'
farchaab commented 1 month ago

Huh, that should not happen !

total_sequence_length is actually used for the coverage calculation, and should not be empty.

Can you have a look at assembly_summary.tsv and see which rows contain NAs ?

Also is this for the buccal_mucosa samples or the gut ?

Can you try to remove the output directory and re-run the pipeline ?

teojcryan commented 1 month ago

I've cleared the output directory and re-ran the pipeline on both buccal_mucosa and gut samples separately, but still ran into the same error on total_sequence_length. I've also inspected both assembly_summary.tsv files and can't find any rows that contain NAs.

farchaab commented 1 month ago

Hello @teojcryan, I see that you use mess with conda, I will try to reproduce the error by deploying environments via conda.

In the meantime, you can try the command below, which ran without issues for me (mess installed from the hmp-template branch, requires apptainer):

mess hmp-template --site buccal_mucosa --threads 8 --sdm apptainer
teojcryan commented 1 month ago

Thanks @farchaab! I completed the run successfully using apptainer.