Closed teojcryan closed 3 weeks ago
Hello @teojcryan and thank you for using mess
!
By default, assembly_finder
searches for reference genomes for each taxa, which sometimes, aren't available (taxa with no reference genomes yield empty json files which cause assembly_finder
to crash).
A quick solution would be to change your command to:
mess hmp-template --site buccal_mucosa -o outdir --threads 8 --reference False
Alternatively, I replaced taxids with genome accessions, so that downloads are faster and reproducible in this branch.
Can you try to re-install mess from this branch and let me know if it fixed your issue ?
Thanks @farchaab!
I've re-installed mess from the 'hmp-template' branch and the downloads do seem to be a lot faster. However, I'm running into a new but related error which seems to stem from an attempt to convert non-finite values (like NA or inf) to integers during a Pandas astype() operation. Specifically, it's failing on the column 'number_of_contigs', which likely contains NaNs or infinite values, causing the type casting to integers to fail.
localcheckpoint calculate_genome_coverages:
input: data/buccal_mucosa/replicates.tsv, data/buccal_mucosa/assembly_finder/assembly_summary.tsv
output: data/buccal_mucosa/coverages.tsv
jobid: 8
reason: Missing output files: <TBD>
resources: tmpdir=/tmp, mem_mb=2000, mem_mib=1908, mem=2000MB, time=00:00:10
DAG of jobs will be updated after completion.
Traceback (most recent call last):
File "/path/.snakemake/scripts/tmpeh4fbcap.calculate_cov.py", line 185, in <module>
df = df.astype(
^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/generic.py", line 6620, in astype
res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/generic.py", line 6643, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 430, in astype
return self.apply(
^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 363, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 101, in _astype_nansafe
return _astype_float_to_int_nansafe(arr, dtype, copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 145, in _astype_float_to_int_nansafe
raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'number_of_contigs'
RuleException:
CalledProcessError in file /path/tools/MeSS/mess/workflow/rules/processing/coverages.smk, line 38:
Command 'set -euo pipefail; /path/.conda/envs/snakemake/bin/python3.12 /path/.snakemake/scripts/tmpeh4fbcap.calculate_cov.py' returned non-zero exit status 1.
[Tue Oct 8 10:24:17 2024]
Error in rule calculate_genome_coverages:
jobid: 8
input: data/buccal_mucosa/replicates.tsv, data/buccal_mucosa/assembly_finder/assembly_summary.tsv
output: data/buccal_mucosa/coverages.tsv
Also appreciate that you've replaced the taxids with genome accessions - can I check if you've done this for all the hmp-template sites or just the buccal mucosa? Ideally I'd like to be working with the gut site data. Just checked and saw that this was updated for the gut site too - thanks!
Thanks for testing out the branch @teojcryan !
I am glad that the downloads are faster !
As you said, the bug in the rule calculate_genome_coverages
stems from having NaN in the number_of_contigs column. Sometimes, this information is missing from the assembly_summary.tsv
table.
Since this column is not useful for read simulation, I removed it with this commit e9ab8698f474b66a9eaa5984ae2031872c824572 (You can still see the contigs number in assembly_finder/assembly_summary.tsv
).
Let me know if this solved your issue !
Thanks for the quick response @farchaab, appreciate it.
The commit has resolved the issue with the number_of_contigs column but now the same issue is coming up with the total_sequence_length column - I'm guessing the fix would be similar!
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'total_sequence_length'
Huh, that should not happen !
total_sequence_length
is actually used for the coverage calculation, and should not be empty.
Can you have a look at assembly_summary.tsv
and see which rows contain NAs ?
Also is this for the buccal_mucosa samples or the gut ?
Can you try to remove the output directory and re-run the pipeline ?
I've cleared the output directory and re-ran the pipeline on both buccal_mucosa and gut samples separately, but still ran into the same error on total_sequence_length
. I've also inspected both assembly_summary.tsv
files and can't find any rows that contain NAs.
Hello @teojcryan, I see that you use mess
with conda, I will try to reproduce the error by deploying environments via conda.
In the meantime, you can try the command below, which ran without issues for me (mess
installed from the hmp-template branch, requires apptainer):
mess hmp-template --site buccal_mucosa --threads 8 --sdm apptainer
Thanks @farchaab! I completed the run successfully using apptainer.
Thanks for developing and maintaining this tool!
MeSS version
mess, version 0.9.1 compiled from source as of 2024-10-24.
Describe the bug
Ran into the following errors when running
mess hmp-template
.Minimal example
Additional context The process completed 507 of 520 steps (98%) before encountering this error and exiting.