Open teojcryan opened 4 days ago
Hello @teojcryan and thank you for using mess
!
By default, assembly_finder
searches for reference genomes for each taxa, which sometimes, aren't available (taxa with no reference genomes yield empty json files which cause assembly_finder
to crash).
A quick solution would be to change your command to:
mess hmp-template --site buccal_mucosa -o outdir --threads 8 --reference False
Alternatively, I replaced taxids with genome accessions, so that downloads are faster and reproducible in this branch.
Can you try to re-install mess from this branch and let me know if it fixed your issue ?
Thanks @farchaab!
I've re-installed mess from the 'hmp-template' branch and the downloads do seem to be a lot faster. However, I'm running into a new but related error which seems to stem from an attempt to convert non-finite values (like NA or inf) to integers during a Pandas astype() operation. Specifically, it's failing on the column 'number_of_contigs', which likely contains NaNs or infinite values, causing the type casting to integers to fail.
localcheckpoint calculate_genome_coverages:
input: data/buccal_mucosa/replicates.tsv, data/buccal_mucosa/assembly_finder/assembly_summary.tsv
output: data/buccal_mucosa/coverages.tsv
jobid: 8
reason: Missing output files: <TBD>
resources: tmpdir=/tmp, mem_mb=2000, mem_mib=1908, mem=2000MB, time=00:00:10
DAG of jobs will be updated after completion.
Traceback (most recent call last):
File "/path/.snakemake/scripts/tmpeh4fbcap.calculate_cov.py", line 185, in <module>
df = df.astype(
^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/generic.py", line 6620, in astype
res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/generic.py", line 6643, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 430, in astype
return self.apply(
^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 363, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 101, in _astype_nansafe
return _astype_float_to_int_nansafe(arr, dtype, copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/.conda/envs/snakemake/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 145, in _astype_float_to_int_nansafe
raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'number_of_contigs'
RuleException:
CalledProcessError in file /path/tools/MeSS/mess/workflow/rules/processing/coverages.smk, line 38:
Command 'set -euo pipefail; /path/.conda/envs/snakemake/bin/python3.12 /path/.snakemake/scripts/tmpeh4fbcap.calculate_cov.py' returned non-zero exit status 1.
[Tue Oct 8 10:24:17 2024]
Error in rule calculate_genome_coverages:
jobid: 8
input: data/buccal_mucosa/replicates.tsv, data/buccal_mucosa/assembly_finder/assembly_summary.tsv
output: data/buccal_mucosa/coverages.tsv
Also appreciate that you've replaced the taxids with genome accessions - can I check if you've done this for all the hmp-template sites or just the buccal mucosa? Ideally I'd like to be working with the gut site data. Just checked and saw that this was updated for the gut site too - thanks!
Thanks for testing out the branch @teojcryan !
I am glad that the downloads are faster !
As you said, the bug in the rule calculate_genome_coverages
stems from having NaN in the number_of_contigs column. Sometimes, this information is missing from the assembly_summary.tsv
table.
Since this column is not useful for read simulation, I removed it with this commit e9ab8698f474b66a9eaa5984ae2031872c824572 (You can still see the contigs number in assembly_finder/assembly_summary.tsv
).
Let me know if this solved your issue !
Thanks for the quick response @farchaab, appreciate it.
The commit has resolved the issue with the number_of_contigs column but now the same issue is coming up with the total_sequence_length column - I'm guessing the fix would be similar!
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'total_sequence_length'
Thanks for developing and maintaining this tool!
MeSS version
mess, version 0.9.1 compiled from source as of 2024-10-24.
Describe the bug
Ran into the following errors when running
mess hmp-template
.Minimal example
Additional context The process completed 507 of 520 steps (98%) before encountering this error and exiting.