rhysnewell / aviary

A hybrid assembly and MAG recovery pipeline (and more!)
GNU General Public License v3.0
82 stars 12 forks source link

EmptyDataError in Aviary recover with long and short reads #73

Closed AroneyS closed 1 year ago

AroneyS commented 2 years ago

Aviary v0.5.3 error in finalize_stats rule. 27/29 steps done, so I guess this is the last job and the other results are fine to use?

Simplified command (recovery from long-read assembly using 20 short reads and 2 long reads):

aviary recover --assembly 719_E1_20-24.ccs.filter.fasta -1 MainAutochamber.201907_E_1_30to34.1.fq.gz ... -2 MainAutochamber.201907_E_1_30to34.2.fq.gz ... --longreads 719_E1_1-5.ccs.filter.fastq.gz 719_E1_20-24.ccs.filter.fastq.gz --longread-type ccs --output results/aviary/binning/long/20221013/719_E1_20-24.ccs.filter -n 64 -m 500

Error:

rule finalize_stats:
    input: bins/checkm.out, bins/checkm2_output/quality_report.tsv, data/coverm_abundances.tsv, data/gtdbtk/done
    output: bins/bin_info.tsv, bins/checkm_minimal.tsv
    jobid: 1
    reason: Missing output files: bins/bin_info.tsv; Input files updated by another job: data/coverm_abundances.tsv, bins/checkm2_output/quality_report.tsv, data/gtdbtk/done, bins/checkm.out
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/data1/tmp

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, disk_mb=1000
Select jobs to execute...
[Fri Oct 14 07:38:30 2022]
Error in rule finalize_stats:
    jobid: 0
    output: bins/bin_info.tsv, bins/checkm_minimal.tsv

RuleException:
EmptyDataErrorin line 715 of /mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk:
No columns to parse from file
  File "/mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk", line 715, in __rule_finalize_stats
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1747, in _make_engine
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 92, in __init__
  File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
rhysnewell commented 2 years ago

Were the CheckM results empty?

AroneyS commented 2 years ago

No. They were as expected

rhysnewell commented 2 years ago

Oh this looks like it is complaining about the coverm_abundances.tsv file, was that empty?

AroneyS commented 2 years ago

Yes, coverm_abundances.tsv is indeed empty. Also, coverm.cov, coverm.filt.cov, long_abundances.tsv, long_cov.tsv and short_cov.tsv are not empty. But short_abundances.tsv is empty.

rhysnewell commented 2 years ago

Does coverm.cov have the short read information? And can you find any error information for the get_abundances rule in the snakemake log?

AroneyS commented 2 years ago

Yes, coverm.cov does have short read information. I can't see any error information for get_abundances.

[Fri Oct 14 06:31:13 2022]
rule get_abundances:
    input: bins/checkm.out
    output: data/coverm_abundances.tsv
    jobid: 25
    reason: Missing output files: data/coverm_abundances.tsv; Input files updated by another job: bins/checkm.out
    threads: 8
    resources: mem_mb=512000, disk_mb=1000, tmpdir=/data1/tmp

Activating conda environment: ../../../../../../../../../mnt/hpccs01/work/microbiome/conda/66a8b59755f121e40e3a82a9714b3ad5
[Fri Oct 14 06:50:20 2022]
Finished job 25.
25 of 29 steps (86%) done
Select jobs to execute...
rhysnewell commented 2 years ago

Has it happened with any other samples? Nothing is jumping out at me that would cause it to fail here

AroneyS commented 2 years ago

I've done 18 assemblies (6 each of long-only, long+short, short-only). All 10 that have finished recover so far have this error.

rhysnewell commented 2 years ago

Okay, this isn't reproducible with the test data that Ben generated. Is this only occurring when you have both long and short reads?

Could you also provide the complete list of rules that aviary is attempting to complete?

AroneyS commented 2 years ago

I haven't tried with only long or only short yet but I can give that a go.

job                      count    min threads    max threads
---------------------  -------  -------------  -------------
checkm2                      1              8              8
checkm_das_tool              1              8              8
checkm_metabat2              1              8              8
checkm_rosella               1              8              8
checkm_semibin               1              8              8
concoct                      1              8              8
das_tool                     1              8              8
finalize_stats               1              1              1
get_abundances               1              8              8
get_bam_indices              1              8              8
gtdbtk                       1              8              8
maxbin2                      1              8              8
metabat2                     1              8              8
metabat_sens                 1              8              8
metabat_spec                 1              8              8
metabat_ssens                1              8              8
metabat_sspec                1              8              8
prepare_binning_files        1              8              8
recover_mags                 1              8              8
refine_dastool               1              8              8
refine_metabat2              1              8              8
refine_rosella               1              8              8
refine_semibin               1              8              8
rosella                      1              8              8
semibin                      1              8              8
singlem_appraise             1              8              8
singlem_pipe_reads           1              1              1
vamb                         1              8              8
vamb_jgi_filter              1              8              8
total                       29              1              8
rhysnewell commented 2 years ago

I haven't tried with only long or only short yet but I can give that a go.

This doesn't make sense with my understanding of this:

I've done 18 assemblies (6 each of long-only, long+short, short-only). All 10 that have finished recover so far have this error.

Wouldn't some of the ones that have finished have to have been long or short only?

What you could try is deleting all the abundances files and see if you can target finalize_stats and it only reruns the abundance rules. If it tries to run others you can give the command " --rerun-triggers mtime" to --snakemake-cmds to see if that prevents the rest of pipeline running in case the code has updated

AroneyS commented 2 years ago

Oh I mean that the assemblies were done with short, long, short+long but that the recovery was done with the same samples (for comparison). So recovery was always done with short+long.

Ok thanks.

AroneyS commented 1 year ago

This happened again with only short-reads. I noticed that the real error is ERROR coverm::bam_generator] Not continuing since when input file pairs have unequal numbers of reads this usually means incorrect / corrupt files were specified. It looks like the forward/reverse reads given to CoverM are mismatched (from different samples). I double checked and they are specified correctly in the original command.

AroneyS commented 1 year ago

The order of short_reads_2 in the config doesn't match that of short_reads_1 and neither match the order in the initial command.

AroneyS commented 1 year ago

Might be due to the set() conversion from commit 4eaefb4b35faec0d77cfa3979f44212227cb7d40