rhysnewell / aviary

A hybrid assembly and MAG recovery pipeline (and more!)
GNU General Public License v3.0
82 stars 12 forks source link

TypeError in Aviary assemble with multiple short-reads #71

Closed AroneyS closed 1 year ago

AroneyS commented 2 years ago

Aviary v0.5.0 coassembly error after spades finishes.

TypeError: '>' not supported between instances of 'TBDString' and 'int'

Simplified command:

aviary assemble \
  -1 sample_1.1.fq.gz sample_2.1.fq.gz \
  -2 sample_1.2.fq.gz sample_2.2.fq.gz \
  --output coassembly_4/assemble \
  -n 64 \
  -m 500
Thank you for using SPAdes!
[Wed Oct 12 03:00:51 2022]
Finished job 2.
1 of 3 steps (33%) done
Removing temporary output data/short_read_assembly.
Select jobs to execute...
Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/__init__.py", line 730, in snakemake
    success = workflow.execute(
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/workflow.py", line 1074, in execute
    raise e
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/workflow.py", line 1070, in execute
    success = self.scheduler.schedule()
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 492, in schedule
    run = self.job_selector(needrun)
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 639, in job_selector_ilp
    return self.job_selector_greedy(jobs)
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 825, in job_selector_greedy
    a = list(map(self.job_weight, jobs))  # resource usage of jobs
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 899, in job_weight
    return [
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 900, in <listcomp>
    self.calc_resource(name, res.get(name, 0)) for name in self.global_resources
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 879, in calc_resource
    if value > gres:
TypeError: '>' not supported between instances of 'TBDString' and 'int'
10/12/2022 03:01:00 AM CRITICAL: Command 'snakemake --snakefile /mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.0/aviary/aviary/modules/Snakefile --directory results/cockatoo/coassemble/20220929/palsa/coassemble/coassemble/coassembly_4/assemble --jobs 64 --rerun-incomplete --configfile 'results/cockatoo/coassemble/20220929/palsa/coassemble/coassemble/coassembly_4/assemble/config.yaml' --nolock  --conda-frontend mamba --default-resources "tmpdir='/data1/pbs.3020244.pbs'" --resources mem_mb=512000   --use-conda --conda-prefix /mnt/hpccs01/work/microbiome/conda  complete_assembly' returned non-zero exit status 1.
wwood commented 2 years ago

Hi Sam,

I think this is fixed in the dev version, if you install using pip install -e . ?

rhysnewell commented 2 years ago

Hi Sam,

This bug has been eluding me for some time. I think it might be an issue with snakemake rather than aviary itself. Updating to latest snakemake versions may fix it, if not then following Ben's instructions should definitely fix it.

AroneyS commented 2 years ago

I'm still getting the error with the up-to-date version. I remade the conda env from aviary.yml and installed with pip install -e . from commit 471118a5178161a5e877d778d23487177164d5be

rhysnewell commented 2 years ago

What version of snakemake are you using?

rhysnewell commented 2 years ago

Also, can you check if https://github.com/rhysnewell/aviary/commit/30627a1eed496dd5e73f03ce147e0eea7db0ed7e fixes the issue for you?

AroneyS commented 2 years ago

Snakemake version 7.15.2 I'll give the new commit a go, thanks.

AroneyS commented 2 years ago

Unfortunately that did not work. I still get the same error.

wwood commented 1 year ago

Same issue with 30627a1ee "Potential fix to TBD comparison error" which currently the main branch.

(aviary-dev5)cl5n007:20221018:~/m/msingle/mess/112_targeted_recovery/avairy$ snakemake --version
7.16.0

Specific error was

Traceback (most recent call last):
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/__init__.py", line 736, in snakemake
    success = workflow.execute(
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/workflow.py", line 1081, in execute
    raise e
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/workflow.py", line 1077, in execute
    success = self.scheduler.schedule()
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/scheduler.py", line 519, in schedule
    run = self.job_selector(needrun)
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/scheduler.py", line 666, in job_selector_ilp
    return self.job_selector_greedy(jobs)
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/scheduler.py", line 852, in job_selector_greedy
    a = list(map(self.job_weight, jobs))  # resource usage of jobs
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/scheduler.py", line 926, in job_weight
    return [
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/scheduler.py", line 927, in <listcomp>
    self.calc_resource(name, res.get(name, 0)) for name in self.global_resources
  File "/home/woodcrob/e/aviary-dev5/lib/python3.10/site-packages/snakemake/scheduler.py", line 906, in calc_resource
    if value > gres:
TypeError: '>' not supported between instances of 'TBDString' and 'int'

Possibly a snakemake issue directly though? e.g. https://github.com/snakemake/snakemake/issues/1850 - what version are you running on where you don't see the issue?

rhysnewell commented 1 year ago

Yeah, I think this is a problem with how Snakemake handles the resource component of each rule. Sometimes it works and sometimes it doesn't, not sure what causes it to break. For the moment https://github.com/rhysnewell/aviary/commit/fd6ff27224a751d8a24d5cdf1136935b489d54e8 remove the use of resource flags in rules, hopefully that fixes the error

wwood commented 1 year ago

Thanks, but same issue there. Maybe need a small test dataset with 2 species at enough but different coverage so binners should work?

wwood commented 1 year ago

I'm wondering if this is related to something else we observe - the short read assembly is removed before this error happens

$ aviary assemble -1 ~/m/msingle/mess/112_targeted_recovery/reads/{}_1.fastq.gz -2 ~/m/msingle/mess/112_targeted_recovery/reads/{}_2.fastq.gz -o {}.megahit_paired -t 32 -m 250 -n 32 --dry-run
...
[Wed Oct 19 06:34:47 2022]
rule assemble_short_reads:
    input: /home/woodcrob/m/msingle/mess/112_targeted_recovery/reads/ERR3139990_1.fastq.gz
    output: data/short_read_assembly/scaffolds.fasta, data/short_read_assembly
    jobid: 2
    benchmark: benchmarks/short_read_assembly_short.benchmark.txt
    reason: Missing output files: data/short_read_assembly/scaffolds.fasta
    threads: 32
    resources: tmpdir=/tmp

Would remove temporary output data/short_read_assembly

[Wed Oct 19 06:34:47 2022]
rule move_spades_assembly:
    input: data/short_read_assembly/scaffolds.fasta
    output: data/final_contigs.fasta
    jobid: 1
    reason: Missing output files: data/final_contigs.fasta; Input files updated by another job: data/short_read_assembly/scaffolds.fasta
    resources: tmpdir=/tmp

Would remove temporary output data/short_read_assembly shouldn't happen until after move_spades_assembly right?

wwood commented 1 year ago

(running now with the temp() in the assembly snakemake removed, will see)

AroneyS commented 1 year ago

I also get the same error using fd6ff27224a751d8a24d5cdf1136935b489d54e8

wwood commented 1 year ago

Yeh, that was it. Just removing that temp() fixed the issue.

However, removing it, which is what I just did to close this issue, means that there are leftover files from short read assembly. @rhysnewell what do you suggest here?

I've also added a tiny test dataset so I could test this in <1min rather than on a real assembly which obviously takes longer. That's in the minimal-testing branch - PR to come in a sec

rhysnewell commented 1 year ago

Cool! Thanks for that, Ben. I'm going to say this was 50% my fault and 50% Snakemake's fault for deleting a folder earlier than it should. There are rules that make reference to that folder later in the DAG, but okay Snakemake.

As for what to do, https://github.com/rhysnewell/aviary/commit/157d73d8bef54ec603a4b2ab1e4d420c10406135 just deleting the folder manually once the assembly has been copied will give the desired behaviour and reduce disk usage

wwood commented 1 year ago

Yes, is annoying to have to deal with this manually. Does it mean we can revert the commit removing the resource bits of the yaml?

Also, for https://github.com/rhysnewell/aviary/commit/157d73d8bef54ec603a4b2ab1e4d420c10406135 - better && rather than ; to separate the two?

I'm a bit obsessed with exit statuses

wwood commented 1 year ago

Does it mean we can revert the commit removing the resource bits of the yaml?

Oh you already did this. Good good

rhysnewell commented 1 year ago

Also, for 157d73d - better && rather than ; to separate the two?

I'm a bit obsessed with exit statuses

I'm not particularly fussed, though there are some instances in the snakemake rules where ; is purposefully used because we want commands to run even if a previous command failed.