jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
74 stars 9 forks source link

[Bug] DAS_Tool starts but fails after calculating contig lengths (binning-prokaryotic.py) #45

Closed abissett closed 5 months ago

abissett commented 6 months ago

Describe the bug: binning-prokaryote fails at 7__dastools step. dastools appears to run/start but fails after calculating contig lengths.......

Versions veba_binning-prokaryotic_1.4.1.sif and the equivalent conda install/env

Command used to produce error: When running veba_binning-prokaryotic using the container veba_binning-prokaryotic_1.4.1.sif, dastools (step 7) does not complete, causing the workflow to fail.

I'm setting all of the inputs as per the docs:

export VEBA_DATABASE=/scratch3/bis068/veba/db

N_JOBS=32 N_ITER=1 #this is set to 1 to make the error show faster, set as 10 usually as per docs ID=548348

OUT_DIR=veba_output/binning/prokaryotic/

FASTA=veba_output/binning/viral/${ID}/output/unbinned.fasta BAM=veba_output/assembly/${ID}/output/mapped.sorted.bam

and then the command used to run the workflow module was:

singularity run veba_binning-prokaryotic_1.4.1.sif binning-prokaryotic.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -o ${OUT_DIR} -m 1500 -I ${N_ITER} --skip_maxbin2

It's kind of weird, it looks like dastool starts running and then for some reason stops? I had the same issue when running via the conda environments. I've switched to the containers, conda envs become kind of complicated on our HPC, and thought this might solve the problem (or avoid it really I guess), but it didn't.

The previous steps all seem to run without problem using the containers (prepocess, assembly, bin-viral).

log files Returncodes for all steps prior are "0" (1 to 6)

7__dastool.e.txt 7__dastool.o.txt 7__dastool.returncode.txt

jolespin commented 6 months ago

I can take a look at this tomorrow.

Can you try with sample S1 here?

https://zenodo.org/records/10094990

You'll need:

https://zenodo.org/records/10094990/files/S1_scaffolds.fasta.gz?download=1 https://zenodo.org/records/10094990/files/S1_mapped.sorted.bam?download=1

Is it possible you can upload the scaffolds and mapped.sorted.bam I can try out tomorrow (maybe e-mail instead of positing link here)?

Can you try the following?

abissett commented 6 months ago

I'll upload the data to a drop box and send you a link, it's too big for email. Maybe it's not needed, as I can reproduce the error with the S1 dataset, as below?.

  1. I have run dastool using the commands generated by the container and my dataset, but using my HPC instance of dastool (module load dastool, v1.1.3). submitted as: "S2B=$('/scratch3/bis068/conda_envs/vebaEnv1.2.0/envs/VEBA-binning-prokaryotic_env/bin/check_scaffolds_to_bins.py' -i /scratch3/bis068/VEBA_cont/veba_output/binning/prokaryotic_test/58348/intermediate/3binning_metabat2/scaffolds_to_bins.tsv,/scratch3/bis068/VEBA_cont/veba_output/binning/prokaryotic_test/58348/intermediate/4binning_maxbin2-107/scaffolds_to_bins.tsv,/scratch3/bis068/VEBA_cont/veba_output/binning/prokaryotic_test/58348/intermediate/5binning_maxbin2-40/scaffolds_to_bins.tsv,/scratch3/bis068/VEBA_cont/veba_output/binning/prokaryotic_test/58348/intermediate/6binning_concoct/scaffolds_to_bins.tsv -n metabat2,maxbin2-107,maxbin2-40,concoct) IFS=" " read -r -a S2B_ARRAY <<< "$S2B"

dastool --bins ${S2B_ARRAY[0]} --contigs ../veba_output/binning/viral/58348/output/unbinned.fasta --outputbasename ../veba_output/binning/prokaryotic_test_HPC/58348/intermediate/7dastool/_ --labels ${S2B_ARRAY[1]} --search_engine diamond --score_threshold 0.1 --write_bins 1 --create_plots 0 --threads 32 --proteins ../veba_output/binning/prokaryotic_test/58348/intermediate/2pyrodigal/gene_models.faa --debug "

This completed as expected.

successfully finished calculating contig lengths. evaluating bin-sets starting bin selection from 634 bins |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| bin selection complete: 282 bins above score threshold selected. extracting bins to ../veba_output/binning/prokaryotic_test_HPC/58348/intermediate/7dastool/DASTool_bins

I tried running only the dastools command as above from the veba conda_env (I was struggling to do it interactively in the container and ran out of time, I can do this if you want me to, but it seems the error is pretty "stable"), but it failed as previously. It appears to something to do with the dastools set up in the container / env rather than dastools per se, since it completed with my system installed dastools.

I then noticed that you're using dastools1.1.2, my system used 1.1.3. I then re-ran the above using 1.1.2 and was able to replicate the error on my system (using dastools independent of veba). Maybe it's something to do with the version??????

Maybe updating the container to use the later versio would fix the issue (although not explain why it's occuring?)

  1. I ran the S1 test data. I ran it through preprocess, assembly, virbin and then prokbin. This failed in the same way as my "real" dataset. See attached logs (S1_n32). logs.zip

  2. I repeated (2) specifying N_JOBS=1. This failed with the same error as (2). Logs in attached (S1_n1).

Let me know what else to do on my end, thanks for the help!

jolespin commented 6 months ago

This is very helpful thank you. Trying to walkthrough this to diagnose:

Here you're using 1.2.0 /scratch3/bis068/conda_envs/vebaEnv1.2.0/

and here you're using 1.4.1 (https://hub.docker.com/r/jolespin/veba_binning-prokaryotic/tags):

singularity run veba_binning-prokaryotic_1.4.1.sif binning-prokaryotic.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -o ${OUT_DIR} -m 1500 -I ${N_ITER} --skip_maxbin2

I'll try with the singularity image. Can you share the command you used to build the singularity image from the docker image?

jolespin commented 6 months ago

Hmm...so I just did a fresh install for this module using the following yml (https://github.com/jolespin/veba/blob/main/install/environments/VEBA-binning-prokaryotic_env.yml):

name: VEBA-binning-prokaryotic_env__v2023.7.7
channels:
  - conda-forge
  - bioconda
  - jolespin
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - _pytorch_select=0.1=cpu_0
  - _r-mutex=1.0.1=anacondar_1
  - abseil-cpp=20200923.3=h9c3ff4c_0
  - aiohttp=3.8.3=py38h0a891b7_1
  - aiosignal=1.3.1=pyhd8ed1ab_0
  - astor=0.8.1=pyh9f0ad1d_0
  - astunparse=1.6.3=pyhd8ed1ab_0
  - async-timeout=4.0.2=pyhd8ed1ab_0
  - attrs=22.2.0=pyh71513ae_0
  - backports=1.1=pyhd3eb1b0_0
  - backports.zoneinfo=0.2.1=py38h497a2fe_4
  - barrnap=0.9=hdfd78af_4
  - bedtools=2.30.0=h468198e_3
  - binutils_impl_linux-64=2.36.1=h193b22a_2
  - binutils_linux-64=2.36=hf3e587d_7
  - biopython=1.79=py38h497a2fe_1
  - blas=1.0=mkl
  - blast=2.14.0=h7d5a4b4_1
  - blinker=1.5=pyhd8ed1ab_0
  - boost=1.70.0=py38h9de70de_1
  - boost-cpp=1.70.0=h7b93d67_3
  - bowtie2=2.4.5=py38h72fc82f_0
  - brotli=1.0.9=h7f98852_6
  - brotli-bin=1.0.9=h7f98852_6
  - brotlipy=0.7.0=py38h497a2fe_1003
  - bwa=0.7.17=h7132678_9
  - bwidget=1.9.14=ha770c72_1
  - bz2file=0.98=py_0
  - bzip2=1.0.8=h7f98852_4
  - c-ares=1.18.1=h7f98852_0
  - ca-certificates=2023.5.7=hbcca054_0
  - cachetools=4.2.4=pyhd8ed1ab_0
  - cairo=1.16.0=h9f066cc_1006
  - capnproto=0.9.1=h780b84a_5
  - certifi=2023.5.7=pyhd8ed1ab_0
  - cffi=1.15.0=py38h3931269_0
  - charset-normalizer=2.0.12=pyhd8ed1ab_0
  - click=8.1.3=unix_pyhd8ed1ab_2
  - colorama=0.4.4=pyh9f0ad1d_0
  - concoct=1.1.0=py38h7be5676_2
  - coreutils=9.3=h0b41bf4_0
  - coverm=0.4.0=hc216eb9_2
  - cryptography=36.0.0=py38h9ce1e76_0
  - curl=7.76.1=h979ede3_1
  - cycler=0.11.0=pyhd8ed1ab_0
  - cython=0.29.28=py38h709712a_0
  - das_tool=1.1.2=r40hdfd78af_2
  - dashing=0.4.0=h735f0e5_3
  - dendropy=4.5.2=pyh3252c3a_0
  - diamond=2.0.8=h56fc30b_0
  - entrez-direct=16.2=he881be0_0
  - expat=2.4.7=h27087fc_0
  - fastani=1.3=he1c1bb9_0
  - fasttree=2.1.11=hec16e2b_1
  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
  - font-ttf-inconsolata=3.000=h77eed37_0
  - font-ttf-source-code-pro=2.038=h77eed37_0
  - font-ttf-ubuntu=0.83=hab24e00_0
  - fontconfig=2.13.96=h8e229c2_2
  - fonts-conda-ecosystem=1=0
  - fonts-conda-forge=1=0
  - fonttools=4.30.0=py38h0a891b7_0
  - fraggenescan=1.31=hec16e2b_4
  - freetype=2.11.0=h70c0345_0
  - fribidi=1.0.10=h36c2ea0_0
  - frozenlist=1.3.3=py38h0a891b7_0
  - gast=0.3.3=py_0
  - gawk=5.1.0=h7f98852_0
  - gcc_impl_linux-64=9.4.0=h03d3576_13
  - gcc_linux-64=9.4.0=h391b98a_7
  - genopype=2023.5.15=py_0
  - gettext=0.21.0=hf68c758_0
  - gfortran_impl_linux-64=9.4.0=h0003116_13
  - gfortran_linux-64=9.4.0=hf0ab688_7
  - giflib=5.2.1=h36c2ea0_2
  - gmp=6.2.1=h58526e2_0
  - google-auth=1.35.0=pyh6c4a22f_0
  - google-auth-oauthlib=0.4.6=pyhd8ed1ab_0
  - google-pasta=0.2.0=pyh8c360ce_0
  - graphite2=1.3.14=h23475e2_0
  - grpc-cpp=1.36.4=hf89561c_1
  - grpcio=1.36.1=py38hdd6454d_0
  - gsl=2.6=he838d99_2
  - gxx_impl_linux-64=9.4.0=h03d3576_13
  - gxx_linux-64=9.4.0=h0316aca_7
  - h5py=2.10.0=nompi_py38h9915d05_106
  - harfbuzz=2.7.2=ha5b49bf_1
  - hdf5=1.10.6=nompi_h6a2412b_1114
  - hmmer=3.1b2=3
  - htslib=1.10.2=hd3b49d5_1
  - icu=67.1=he1b5a44_0
  - idba=1.1.3=1
  - idna=3.3=pyhd8ed1ab_0
  - importlib-metadata=6.0.0=pyha770c72_0
  - infernal=1.1.4=pl5321h031d066_4
  - intel-openmp=2019.4=243
  - joblib=0.17.0=py_0
  - jpeg=9e=h7f98852_0
  - k8=0.2.5=hd03093a_2
  - keras-preprocessing=1.1.2=pyhd8ed1ab_0
  - kernel-headers_linux-64=2.6.32=he073ed8_15
  - keyutils=1.6.1=h166bdaf_0
  - kiwisolver=1.3.2=py38h1fd1430_1
  - krb5=1.17.2=h926e7f8_0
  - lcms2=2.12=hddcbb42_0
  - ld_impl_linux-64=2.36.1=hea4e1c9_2
  - libblas=3.9.0=1_h86c2bf4_netlib
  - libbrotlicommon=1.0.9=h7f98852_6
  - libbrotlidec=1.0.9=h7f98852_6
  - libbrotlienc=1.0.9=h7f98852_6
  - libcblas=3.9.0=5_h92ddd45_netlib
  - libcurl=7.76.1=hc4aaa36_1
  - libdeflate=1.6=h516909a_0
  - libedit=3.1.20210714=h7f8727e_0
  - libev=4.33=h516909a_1
  - libffi=3.4.2=h7f98852_5
  - libgcc=7.2.0=h69d50b8_2
  - libgcc-devel_linux-64=9.4.0=hd854feb_13
  - libgcc-ng=12.2.0=h65d4601_19
  - libgfortran-ng=11.2.0=h69a702a_13
  - libgfortran5=11.2.0=h5c6108e_13
  - libglib=2.70.2=h174f98d_4
  - libgomp=12.2.0=h65d4601_19
  - libiconv=1.16=h516909a_0
  - libidn2=2.3.2=h7f98852_0
  - liblapack=3.9.0=5_h92ddd45_netlib
  - libllvm10=10.0.1=he513fc3_3
  - libmklml=2019.0.5=h06a4308_0
  - libnghttp2=1.47.0=h727a467_0
  - libnsl=2.0.0=h7f98852_0
  - libopenblas=0.3.21=pthreads_h78a6416_3
  - libpng=1.6.37=h21135ba_2
  - libprotobuf=3.15.8=h780b84a_1
  - libsanitizer=9.4.0=h79bfe98_13
  - libssh2=1.10.0=ha56f1ee_2
  - libstdcxx-devel_linux-64=9.4.0=hd854feb_13
  - libstdcxx-ng=12.2.0=h46fd767_19
  - libtiff=4.2.0=hbd63e13_2
  - libunistring=0.9.10=h7f98852_0
  - libuuid=2.32.1=h7f98852_1000
  - libwebp=1.2.2=h55f646e_0
  - libwebp-base=1.2.2=h7f98852_1
  - libxcb=1.14=h7b6447c_0
  - libxml2=2.9.10=h68273f3_2
  - libzlib=1.2.13=h166bdaf_4
  - lightgbm=3.3.5=py38h8dc9893_0
  - llvm-openmp=8.0.1=hc9558a2_0
  - lz4-c=1.9.3=h9c3ff4c_1
  - make=4.3=hd18ef5c_1
  - markdown=3.4.1=pyhd8ed1ab_0
  - markupsafe=2.1.2=py38h1de0b5d_0
  - mash=2.3=he348c14_1
  - matplotlib-base=3.5.1=py38hf4fb855_0
  - maxbin2=2.2.7=he1b5a44_1
  - metabat2=2.15=h986a166_1
  - minimap2=2.17=h5bf99c6_4
  - mkl=2020.2=256
  - multidict=6.0.4=py38h1de0b5d_0
  - munkres=1.1.4=pyh9f0ad1d_0
  - ncurses=6.2=h58526e2_4
  - ninja=1.11.0=h924138e_0
  - nose=1.3.7=py38h32f6830_1004
  - numpy=1.19.5=py38h8246c76_3
  - oauthlib=3.2.2=pyhd8ed1ab_0
  - openblas=0.3.21=pthreads_h320a7e8_3
  - openmp=8.0.1=0
  - openssl=1.1.1u=hd590300_0
  - opt_einsum=3.3.0=pyhd8ed1ab_1
  - packaging=21.3=pyhd8ed1ab_0
  - pandas=1.4.1=py38h43a58ef_0
  - pango=1.42.4=h69149e4_5
  - pathlib2=2.3.7.post1=py38h578d9bd_0
  - pcre=8.45=h9c3ff4c_0
  - pcre2=10.36=h032f7d1_1
  - perl=5.32.1=2_h7f98852_perl5
  - perl-archive-tar=2.40=pl5321hdfd78af_0
  - perl-base=2.23=pl5321hdfd78af_2
  - perl-business-isbn=3.007=pl5321hdfd78af_0
  - perl-business-isbn-data=20210112.006=pl5321hdfd78af_0
  - perl-carp=1.38=pl5321hdfd78af_4
  - perl-common-sense=3.75=pl5321hdfd78af_0
  - perl-compress-raw-bzip2=2.201=pl5321h87f3376_1
  - perl-compress-raw-zlib=2.105=pl5321h87f3376_0
  - perl-constant=1.33=pl5321hdfd78af_2
  - perl-data-dumper=2.183=pl5321hec16e2b_1
  - perl-digest-hmac=1.04=pl5321hdfd78af_0
  - perl-digest-md5=2.58=pl5321hec16e2b_1
  - perl-encode=3.19=pl5321hec16e2b_1
  - perl-encode-locale=1.05=pl5321hdfd78af_7
  - perl-exporter=5.72=pl5321hdfd78af_2
  - perl-exporter-tiny=1.002002=pl5321hdfd78af_0
  - perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0
  - perl-file-listing=6.15=pl5321hdfd78af_0
  - perl-file-spec=3.48_01=pl5321hdfd78af_2
  - perl-html-parser=3.81=pl5321h4ac6f70_1
  - perl-html-tagset=3.20=pl5321hdfd78af_4
  - perl-http-cookies=6.10=pl5321hdfd78af_0
  - perl-http-daemon=6.16=pl5321hdfd78af_0
  - perl-http-date=6.05=pl5321hdfd78af_0
  - perl-http-message=6.36=pl5321hdfd78af_0
  - perl-http-negotiate=6.01=pl5321hdfd78af_4
  - perl-io-compress=2.201=pl5321hdbdd923_2
  - perl-io-html=1.004=pl5321hdfd78af_0
  - perl-io-socket-ssl=2.074=pl5321hdfd78af_0
  - perl-io-zlib=1.14=pl5321hdfd78af_0
  - perl-json=4.10=pl5321hdfd78af_0
  - perl-json-xs=2.34=pl5321h4ac6f70_6
  - perl-libwww-perl=6.39=pl5321hdfd78af_1
  - perl-list-moreutils=0.430=pl5321hdfd78af_0
  - perl-list-moreutils-xs=0.430=pl5321h031d066_2
  - perl-lwp-mediatypes=6.04=pl5321hdfd78af_1
  - perl-lwp-simple=6.39=pl5321h9ee0642_5
  - perl-mime-base64=3.16=pl5321hec16e2b_2
  - perl-net-http=6.22=pl5321hdfd78af_0
  - perl-net-ssleay=1.92=pl5321h0e0aaa8_1
  - perl-ntlm=1.09=pl5321hdfd78af_5
  - perl-parent=0.236=pl5321hdfd78af_2
  - perl-pathtools=3.75=pl5321hec16e2b_3
  - perl-scalar-list-utils=1.62=pl5321hec16e2b_1
  - perl-socket=2.027=pl5321h031d066_4
  - perl-time-local=1.35=pl5321hdfd78af_0
  - perl-timedate=2.33=pl5321hdfd78af_2
  - perl-try-tiny=0.31=pl5321hdfd78af_1
  - perl-types-serialiser=1.01=pl5321hdfd78af_0
  - perl-uri=5.12=pl5321hdfd78af_0
  - perl-url-encode=0.03=pl5321h9ee0642_0
  - perl-www-robotrules=6.02=pl5321hdfd78af_4
  - perl-xsloader=0.24=pl5321hd8ed1ab_0
  - pillow=9.0.1=py38h22f2fdc_0
  - pip=22.0.4=pyhd8ed1ab_0
  - pixman=0.40.0=h36c2ea0_0
  - pplacer=1.1.alpha19=h9ee0642_2
  - prodigal=2.6.3=hec16e2b_4
  - protobuf=3.15.8=py38h709712a_0
  - pullseq=1.0.2=hbd632db_7
  - pyasn1=0.4.8=py_0
  - pyasn1-modules=0.2.7=py_0
  - pycparser=2.21=pyhd8ed1ab_0
  - pyjwt=2.6.0=pyhd8ed1ab_0
  - pyopenssl=22.0.0=pyhd8ed1ab_0
  - pyparsing=3.0.7=pyhd8ed1ab_0
  - pyrodigal=2.1.0=py38he5da3d1_3
  - pysam=0.16.0.1=py38hbdc2ae9_1
  - pysocks=1.7.1=py38h578d9bd_4
  - python=3.8.12=hb7a2778_2_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python-flatbuffers=1.12=pyhd8ed1ab_1
  - python-tzdata=2021.5=pyhd8ed1ab_0
  - python_abi=3.8=2_cp38
  - pytorch=1.7.1=cpu_py38h6a09485_0
  - pytz=2021.3=pyhd8ed1ab_0
  - pytz-deprecation-shim=0.1.0.post0=py38h578d9bd_1
  - pyu2f=0.1.5=pyhd8ed1ab_0
  - r-assertthat=0.2.1=r40hc72bb7e_2
  - r-backports=1.4.1=r40hcfec24a_0
  - r-base=4.0.3=ha43b4e8_3
  - r-bitops=1.0_7=r40hcfec24a_0
  - r-brio=1.1.3=r40hcfec24a_0
  - r-callr=3.7.0=r40hc72bb7e_0
  - r-catools=1.18.2=r40h03ef668_0
  - r-cli=3.2.0=r40h03ef668_0
  - r-codetools=0.2_18=r40hc72bb7e_0
  - r-colorspace=2.0_3=r40h06615bd_0
  - r-crayon=1.5.0=r40hc72bb7e_0
  - r-data.table=1.14.2=r40hcfec24a_0
  - r-desc=1.4.0=r40hc72bb7e_0
  - r-diffobj=0.3.5=r40hcfec24a_0
  - r-digest=0.6.29=r40h03ef668_0
  - r-domc=1.3.8=r40ha770c72_0
  - r-ellipsis=0.3.2=r40hcfec24a_0
  - r-evaluate=0.15=r40hc72bb7e_0
  - r-fansi=1.0.2=r40hcfec24a_0
  - r-farver=2.1.0=r40h03ef668_0
  - r-foreach=1.5.2=r40hc72bb7e_0
  - r-ggplot2=3.3.5=r40hc72bb7e_0
  - r-glue=1.6.2=r40h06615bd_0
  - r-gplots=3.1.1=r40hc72bb7e_0
  - r-gtable=0.3.0=r40hc72bb7e_3
  - r-gtools=3.9.2=r40hcfec24a_0
  - r-isoband=0.2.5=r40h03ef668_0
  - r-iterators=1.0.14=r40hc72bb7e_0
  - r-jsonlite=1.8.0=r40h06615bd_0
  - r-kernsmooth=2.23_20=r40h742201e_0
  - r-labeling=0.4.2=r40hc72bb7e_1
  - r-lattice=0.20_45=r40hcfec24a_0
  - r-lifecycle=1.0.1=r40hc72bb7e_0
  - r-magrittr=2.0.2=r40hcfec24a_0
  - r-mass=7.3_55=r40hcfec24a_0
  - r-matrix=1.4_0=r40he454529_0
  - r-mgcv=1.8_39=r40h0154571_0
  - r-munsell=0.5.0=r40hc72bb7e_1004
  - r-nlme=3.1_155=r40h859d828_0
  - r-pillar=1.7.0=r40hc72bb7e_0
  - r-pkgconfig=2.0.3=r40hc72bb7e_1
  - r-pkgload=1.2.4=r40h03ef668_0
  - r-praise=1.0.0=r40hc72bb7e_1005
  - r-processx=3.5.2=r40hcfec24a_0
  - r-ps=1.6.0=r40hcfec24a_0
  - r-r6=2.5.1=r40hc72bb7e_0
  - r-rcolorbrewer=1.1_2=r40h785f33e_1003
  - r-rcpp=1.0.8.2=r40h7525677_0
  - r-rematch2=2.1.2=r40hc72bb7e_1
  - r-rlang=0.4.12=r40hcfec24a_0
  - r-rprojroot=2.0.2=r40hc72bb7e_0
  - r-rstudioapi=0.13=r40hc72bb7e_0
  - r-scales=1.1.1=r40hc72bb7e_0
  - r-testthat=3.1.2=r40h03ef668_0
  - r-tibble=3.1.6=r40hcfec24a_0
  - r-utf8=1.2.2=r40hcfec24a_0
  - r-vctrs=0.3.8=r40hcfec24a_1
  - r-viridislite=0.4.0=r40hc72bb7e_0
  - r-waldo=0.3.1=r40hc72bb7e_0
  - r-withr=2.5.0=r40hc72bb7e_0
  - re2=2021.04.01=h9c3ff4c_0
  - readline=8.1=h46c0cb4_0
  - requests=2.27.1=pyhd8ed1ab_0
  - requests-oauthlib=1.3.1=pyhd8ed1ab_0
  - rsa=4.9=pyhd8ed1ab_0
  - ruby=2.5.1=haf1161a_0
  - samtools=1.10=h2e538c0_3
  - scandir=1.10.0=py38h497a2fe_4
  - scikit-learn=0.23.2=py38h5d63f67_3
  - scipy=1.8.0=py38h56a6a73_1
  - sed=4.8=he412f7d_0
  - seqkit=2.4.0=h9ee0642_0
  - setuptools=60.9.3=py38h578d9bd_0
  - skorch=0.9.0=pyh7b7c402_0
  - snappy=1.1.9=hbd366e4_2
  - soothsayer_utils=2022.6.24=py_0
  - sqlite=3.38.0=hc218d9a_0
  - starcode=1.4=hec16e2b_2
  - subread=2.0.3=h7132678_1
  - sysroot_linux-64=2.12=he073ed8_15
  - tar=1.34=ha1f6473_0
  - tbb=2020.3=hfd86e86_0
  - tensorboard=2.4.1=pyhd8ed1ab_1
  - tensorboard-plugin-wit=1.8.1=pyhd8ed1ab_0
  - tensorflow=2.4.0=py38h578d9bd_0
  - tensorflow-base=2.4.0=py38h01d9eeb_0
  - tensorflow-estimator=2.4.0=pyh9656e83_0
  - threadpoolctl=3.1.0=pyh8a188c0_0
  - tk=8.6.12=h27826a3_0
  - tktable=2.10=hb7b940f_3
  - tqdm=4.54.1=pyhd8ed1ab_1
  - trnascan-se=2.0.12=pl5321h031d066_0
  - tzdata=2021e=he74cb21_0
  - tzlocal=4.1=py38h578d9bd_1
  - unicodedata2=14.0.0=py38h497a2fe_0
  - unzip=6.0=h7f98852_3
  - urllib3=1.26.8=pyhd8ed1ab_1
  - werkzeug=2.2.2=pyhd8ed1ab_0
  - wget=1.20.3=ha56f1ee_1
  - wheel=0.37.1=pyhd8ed1ab_0
  - xorg-kbproto=1.0.7=h7f98852_1002
  - xorg-libice=1.0.10=h7f98852_0
  - xorg-libsm=1.2.3=hd9c2040_1000
  - xorg-libx11=1.7.2=h7f98852_0
  - xorg-libxext=1.3.4=h7f98852_1
  - xorg-libxrender=0.9.10=h7f98852_1003
  - xorg-libxt=1.2.1=h7f98852_2
  - xorg-renderproto=0.11.1=h7f98852_1002
  - xorg-xextproto=7.3.0=h7f98852_1002
  - xorg-xproto=7.0.31=h7f98852_1007
  - xz=5.2.5=h516909a_1
  - yaml=0.1.7=h14c3975_1001
  - yarl=1.8.2=py38h0a891b7_0
  - zipp=3.11.0=pyhd8ed1ab_0
  - zlib=1.2.13=h166bdaf_4
  - zstd=1.4.9=ha95c52a_0
  - pip:
      - absl-py==0.15.0
      - checkm2==1.0.1
      - llvmlite==0.38.0
      - numba==0.55.1
      - six==1.15.0
      - tabulate==0.8.9
      - termcolor==1.1.0
      - tiara==1.0.2
      - typing-extensions==3.7.4.3
      - wrapt==1.12.1

I installed the module like this:

# Create env
mamba env create -n test_env -f test.yml

# Get files (1.4.1 should be fine for the prok binning module)
VERSION="1.4.2"
wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz
tar -xvf v${VERSION}.tar.gz

# Copy the files into bin
cp -rf veba-1.4.2/src/*.py /expanse/projects/jcl110/miniconda3/envs/test_env/bin/
cp -rf veba-1.4.2/src/scripts /expanse/projects/jcl110/miniconda3/envs/test_env/bin/
ln -sf /expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/* /expanse/projects/jcl110/miniconda3/envs/test_env/bin/

Here's the test files:

wget https://zenodo.org/records/10094990/files/S1_scaffolds.fasta.gz?download=1 -O S1_scaffolds.fasta.gz
gzip -d S1_scaffolds.fasta.gz
wget https://zenodo.org/records/10094990/files/S1_mapped.sorted.bam?download=1 -O S1_mapped.sorted.bam

I just ran it to completion using 1 thread and 20GB of memory:

(base) [jespinoz@exp-15-54 S1]$ conda activate test_env
(test_env) [jespinoz@exp-15-54 S1]$ binning-prokaryotic.py -f S1_scaffolds.fasta -b S1_mapped.sorted.bam -n S1  -m 1500 -I 1 --skip_maxbin2 --veba_database /expanse/projects/jcl110/db/veba/VDB_v6/
======================
binning-prokaryotic.py
======================
--------------
Configuration:
--------------
........
Name: S1
........
Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)  [GCC 9.4.0]
Python path: /expanse/projects/jcl110/miniconda3/envs/test_env/bin/python
GenoPype version: 2023.5.15
Script version: 2023.11.30
VEBA Database: /expanse/projects/jcl110/db/veba/VDB_v6/
Moment: 2024-01-16 18:54:31
Directory: /expanse/projects/jcl110/VEBA_v2_CaseStudies/Containers/S1
/tmp
Commands:
['/expanse/projects/jcl110/miniconda3/envs/test_env/bin/binning-prokaryotic.py', '-f', 'S1_scaffolds.fasta', '-b', 'S1_mapped.sorted.bam', '-n', 'S1', '-m', '1500', '-I', '1', '--skip_maxbin2', '--veba_database', '/expanse/projects/jcl110/db/veba/VDB_v6/']
------------------------------------------------------------------
Adding executables to path from the following source: CONDA_PREFIX
------------------------------------------------------------------
DAS_Tool --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/DAS_Tool
append_geneid_to_barrnap_gff.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/append_geneid_to_barrnap_gff.py'
append_geneid_to_prodigal_gff.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/append_geneid_to_prodigal_gff.py'
barrnap --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/barrnap
binning_wrapper.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/binning_wrapper.py'
check_scaffolds_to_bins.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/check_scaffolds_to_bins.py'
checkm2 --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/checkm2
compile_gff.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/compile_gff.py'
concatenate_dataframes.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/concatenate_dataframes.py'
concoct --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/concoct
concoct_coverage_table.py --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/concoct_coverage_table.py
consensus_domain_classification.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/consensus_domain_classification.py'
coverm --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/coverm
cut_up_fasta.py --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/cut_up_fasta.py
extract_fasta_bins.py --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/extract_fasta_bins.py
featureCounts --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/featureCounts
filter_checkm2_results.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/filter_checkm2_results.py'
merge_cutup_clustering.py --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/merge_cutup_clustering.py
metabat2 --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/metabat2
partition_gene_models.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/partition_gene_models.py'
pyrodigal --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/pyrodigal
scaffolds_to_bins.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/scaffolds_to_bins.py'
seqkit --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit
subset_table.py --> '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/subset_table.py'
tRNAscan-SE --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/tRNAscan-SE
tiara --> /expanse/projects/jcl110/miniconda3/envs/test_env/bin/tiara

===========================
. .. ... Compiling ... .. .
===========================
Step: 1, 1__coverage | log_prefix = 1__coverage | Calculating coverage for assembly via CoverM
Step: 2, 2__pyrodigal | log_prefix = 2__pyrodigal | Gene calls via Pyrodigal
Step: 3, 3__binning_metabat2 | log_prefix = 3__binning_metabat2 | Binning via MetaBat2 [Iteration=1]
Step: 4, 4__binning_maxbin2-107 | log_prefix = 4__binning_maxbin2-107 | [Skipping] Binning via MaxBin2 [Marker Set=107] [Iteration=1]
Step: 5, 5__binning_maxbin2-40 | log_prefix = 5__binning_maxbin2-40 | [Skipping] Binning via MaxBin2 [Marker Set=40] [Iteration=1]
Step: 6, 6__binning_concoct | log_prefix = 6__binning_concoct | Binning via CONCOCT [Iteration=1]
Step: 7, 7__dastool | log_prefix = 7__dastool | Evaluation via DAS_Tool [Iteration=1]
Step: 8, 8__checkm2 | log_prefix = 8__checkm2 | Evaluation via CheckM2 [Iteration=1]
Step: 9, 9__barrnap | log_prefix = 9__barrnap | Detecting rRNA genes
Step: 10, 10__trnascan-se | log_prefix = 10__trnascan-se | Detecting tRNA genes
Step: 11, 11__featurecounts | log_prefix = 11__featurecounts | Counting reads
Step: 12, 12__consolidate | log_prefix = 12__consolidate | Consolidate output files
______________________________________________
. .. ... binning-prokaryotic.py || S1 ... .. .
______________________________________________

Executing pipeline:   0%|                                                                                                                                                                                                              | 0/12 [00:00<?, ?it/s]===============
. 1__coverage .
===============
Input: ['S1_mapped.sorted.bam']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_metabat.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_noheader.tsv']

Command:
cat S1_scaffolds.fasta | /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit seq -m 1500 > veba_output/binning/prokaryotic/S1/output/unbinned.fasta && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/coverm contig --threads 1 --methods metabat --bam-files S1_mapped.sorted.bam > veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_metabat.tsv &&
    python -c "import pandas as pd; df = pd.read_csv('veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_metabat.tsv', sep='  ', index_col=0); df.loc[:,df.columns.map(lambda x: x.startswith('mapped.sorted.bam') and (not '-var' in x))].to_csv('veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_noheader.tsv', sep='  ', header=None)"

Validating the following input files:
[=] File exists (185 MB): S1_mapped.sorted.bam

Loading. .. ... .....

Validating the following output files:
[=] File exists (1 MB): veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_metabat.tsv
[=] File exists (1 MB): veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_noheader.tsv

Duration: 00:00:00

================
. 2__pyrodigal .
================
Input: ['S1_scaffolds.fasta']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.gff', 'veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.faa', 'veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.ffn']

Command:
cat S1_scaffolds.fasta | /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit seq -m 1500 > veba_output/binning/prokaryotic/S1/tmp/tmp.fasta && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/pyrodigal -p meta -i veba_output/binning/prokaryotic/S1/tmp/tmp.fasta -g 11 -f gff -d veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.ffn -a veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.faa --min-gene 90 --min-edge-gene 60 --max-overlap 60 > veba_output/binning/prokaryotic/S1/tmp/tmp.gff && cat veba_output/binning/prokaryotic/S1/tmp/tmp.gff | '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/append_geneid_to_prodigal_gff.py' -a gene_id > veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.gff && rm veba_output/binning/prokaryotic/S1/tmp/tmp.*

Validating the following input files:
[=] File exists (39 MB): S1_scaffolds.fasta

Loading. .. ... .....

Validating the following output files:
[=] File exists (9 MB): veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.gff
[=] File exists (10 MB): veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.faa
[=] File exists (24 MB): veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.ffn

Duration: 00:00:00

=======================
. 3__binning_metabat2 .
=======================
Input: ['S1_scaffolds.fasta', 'veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_metabat.tsv']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/3__binning_metabat2/scaffolds_to_bins.tsv']

Command:
'/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/binning_wrapper.py' -a metabat2 -f S1_scaffolds.fasta -c veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_metabat.tsv -o veba_output/binning/prokaryotic/S1/intermediate/3__binning_metabat2 -m 1500 -s 150000 --n_jobs 1 --random_state 1 --bin_prefix S1__METABAT2__P.1__ --remove_bins --remove_intermediate_files

Loading. .. ... .....

Duration: 00:00:00

==========================
. 4__binning_maxbin2-107 .
==========================
Input: ['S1_scaffolds.fasta', 'veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_noheader.tsv']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/4__binning_maxbin2-107/scaffolds_to_bins.tsv']

Command:
echo '' > veba_output/binning/prokaryotic/S1/intermediate/4__binning_maxbin2-107/scaffolds_to_bins.tsv && echo 'Skipping MaxBin2'

Loading. .. ... .....

Duration: 00:00:00

=========================
. 5__binning_maxbin2-40 .
=========================
Input: ['S1_scaffolds.fasta', 'veba_output/binning/prokaryotic/S1/intermediate/1__coverage/coverage_noheader.tsv']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/5__binning_maxbin2-40/scaffolds_to_bins.tsv']

Command:
echo '' > veba_output/binning/prokaryotic/S1/intermediate/5__binning_maxbin2-40/scaffolds_to_bins.tsv && echo 'Skipping MaxBin2'

Loading. .. ... .....

Duration: 00:00:00

======================
. 6__binning_concoct .
======================
Input: ['S1_scaffolds.fasta', 'S1_mapped.sorted.bam']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/6__binning_concoct/scaffolds_to_bins.tsv']

Command:
'/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/binning_wrapper.py' --concoct_fragment_length 10000 --concoct_overlap_length 0 -a concoct -f S1_scaffolds.fasta -b S1_mapped.sorted.bam -o veba_output/binning/prokaryotic/S1/intermediate/6__binning_concoct -m 1500 -s 150000 --n_jobs 1 --random_state 1 --bin_prefix S1__CONCOCT__P.1__ --remove_bins --remove_intermediate_files

Loading. .. ... .....

Duration: 00:00:00

==============
. 7__dastool .
==============
Input: ['veba_output/binning/prokaryotic/S1/intermediate/3__binning_metabat2/scaffolds_to_bins.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/4__binning_maxbin2-107/scaffolds_to_bins.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/5__binning_maxbin2-40/scaffolds_to_bins.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/6__binning_concoct/scaffolds_to_bins.tsv', 'S1_scaffolds.fasta']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins', 'veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_summary.txt', 'veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_scaffolds2bin.no_eukaryota.txt']

Command:
S2B=$('/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/check_scaffolds_to_bins.py' -i veba_output/binning/prokaryotic/S1/intermediate/3__binning_metabat2/scaffolds_to_bins.tsv,veba_output/binning/prokaryotic/S1/intermediate/4__binning_maxbin2-107/scaffolds_to_bins.tsv,veba_output/binning/prokaryotic/S1/intermediate/5__binning_maxbin2-40/scaffolds_to_bins.tsv,veba_output/binning/prokaryotic/S1/intermediate/6__binning_concoct/scaffolds_to_bins.tsv -n metabat2,maxbin2-107,maxbin2-40,concoct) && IFS=" " read -r -a S2B_ARRAY <<< "$S2B" && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/DAS_Tool --bins ${S2B_ARRAY[0]} --contigs S1_scaffolds.fasta --outputbasename veba_output/binning/prokaryotic/S1/intermediate/7__dastool/_ --labels ${S2B_ARRAY[1]} --search_engine diamond --score_threshold 0.1 --write_bins 1 --create_plots 0 --threads 1 --proteins veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.faa --debug && cat veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/*.fa | /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit seq -m 3000 > veba_output/binning/prokaryotic/S1/tmp/scaffolds.binned.gte3000.fasta && mkdir -p veba_output/binning/prokaryotic/S1/intermediate/7__dastool/consensus_domain_classification && mkdir -p veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/eukaryota && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/tiara -i veba_output/binning/prokaryotic/S1/tmp/scaffolds.binned.gte3000.fasta -o veba_output/binning/prokaryotic/S1/intermediate/7__dastool/consensus_domain_classification/tiara_output.tsv --probabilities -m 3000 -t 1 && '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/consensus_domain_classification.py' -i veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_scaffolds2bin.txt -t veba_output/binning/prokaryotic/S1/intermediate/7__dastool/consensus_domain_classification/tiara_output.tsv -o veba_output/binning/prokaryotic/S1/intermediate/7__dastool/consensus_domain_classification --logit_transform softmax && for ID_GENOME in $(cat veba_output/binning/prokaryotic/S1/intermediate/7__dastool/consensus_domain_classification/eukaryota.list); do mv veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/${ID_GENOME}.* veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/eukaryota; done && cat veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/eukaryota/*.fa | grep "^>" | cut -f1 -d " " | cut -c2- > veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/eukaryota/eukaryota.scaffolds.list && cat veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_scaffolds2bin.txt | grep -v -f veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/eukaryota/eukaryota.scaffolds.list > veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_scaffolds2bin.no_eukaryota.txt && cut -f1 veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_scaffolds2bin.no_eukaryota.txt > veba_output/binning/prokaryotic/S1/intermediate/7__dastool/binned.list && '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/partition_gene_models.py' -i veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_scaffolds2bin.no_eukaryota.txt -g veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.gff -d veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.ffn -a veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.faa -o veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins --use_mag_as_description && rm -rf veba_output/binning/prokaryotic/S1/intermediate/7__dastool/_.seqlength veba_output/binning/prokaryotic/S1/tmp/scaffolds.binned.gte3000.fasta

Loading. .. ... .....

Duration: 00:00:00

==============
. 8__checkm2 .
==============
Input: ['veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins', 'S1_scaffolds.fasta']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/checkm2_results.filtered.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/*.list', 'veba_output/binning/prokaryotic/S1/tmp/unbinned_1.fasta']

Command:
mkdir -p veba_output/binning/prokaryotic/S1/tmp/checkm2 && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/checkm2 predict -i veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins/*.faa --genes -o veba_output/binning/prokaryotic/S1/intermediate/8__checkm2 -t 1 --force -x faa --tmpdir veba_output/binning/prokaryotic/S1/tmp/checkm2 --database_path /expanse/projects/jcl110/db/veba/VDB_v6/Classify/CheckM2/uniref100.KO.1.dmnd && gzip veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/diamond_output/*.tsv && '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/filter_checkm2_results.py' -i veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/quality_report.tsv -b veba_output/binning/prokaryotic/S1/intermediate/7__dastool/__DASTool_bins -o veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered -f S1_scaffolds.fasta -m 1500 --completeness 50.0 --contamination 10.0 -x fa && '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/scaffolds_to_bins.py' -x fa -i veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/genomes > veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/scaffolds_to_bins.tsv && cat S1_scaffolds.fasta | /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit seq -m 1500 | /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit grep --pattern-file veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/unbinned.list > veba_output/binning/prokaryotic/S1/tmp/unbinned_1.fasta && rm -rf veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/protein_files

Running. .. ... .....

Log files:
veba_output/binning/prokaryotic/S1/log/8__checkm2.*

Duration: 00:12:16

Executing pipeline:  67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                  | 8/12 [12:16<06:08, 92.02s/it]==============
. 9__barrnap .
==============
Input: ['veba_output/binning/prokaryotic/S1/intermediate/*__dastool/consensus_domain_classification/predictions.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/genomes/*']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/9__barrnap/*.rRNA', 'veba_output/binning/prokaryotic/S1/intermediate/9__barrnap/*.rRNA.gff']

Command:
cat veba_output/binning/prokaryotic/S1/intermediate/*__dastool/consensus_domain_classification/predictions.tsv > veba_output/binning/prokaryotic/S1/tmp/genomes_to_domain.tsv
OUTPUT_DIRECTORY=veba_output/binning/prokaryotic/S1/intermediate/9__barrnap
FP=veba_output/binning/prokaryotic/S1/tmp/genomes_to_domain.tsv
for DOMAIN in $(cut -f2 $FP | sort -u);
do
    DOMAIN_ABBREVIATION=$(echo $DOMAIN | python -c 'import sys; print(sys.stdin.read().lower()[:3])')

    # Get MAGs for each domain (not all will have passed QC)
    for ID in $(cat $FP | grep $DOMAIN | cut -f1)
    do
        GENOME_FASTA=$(ls veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/genomes/$ID.fa) || GENOME_FASTA=""
        if [ -e "$GENOME_FASTA" ]; then
            >$OUTPUT_DIRECTORY/$ID.rRNA
            >$OUTPUT_DIRECTORY/$ID.rRNA.gff
            /expanse/projects/jcl110/miniconda3/envs/test_env/bin/barrnap --kingdom $DOMAIN_ABBREVIATION --threads 1 --lencutoff 0.8 --reject 0.25 --evalue 1e-06 --outseq $OUTPUT_DIRECTORY/$ID.rRNA $GENOME_FASTA  | '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/append_geneid_to_barrnap_gff.py' > $OUTPUT_DIRECTORY/$ID.rRNA.gff
            rm $GENOME_FASTA.fai
        fi
    done
done

rm -f veba_output/binning/prokaryotic/S1/tmp/genomes_to_domain.tsv

Running. .. ... .....

Log files:
veba_output/binning/prokaryotic/S1/log/9__barrnap.*

Duration: 00:12:18

Executing pipeline:  75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                 | 9/12 [12:18<03:15, 65.14s/it]===================
. 10__trnascan-se .
===================
Input: ['veba_output/binning/prokaryotic/S1/intermediate/*__dastool/consensus_domain_classification/predictions.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/genomes/*']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/10__trnascan-se/*.tRNA', 'veba_output/binning/prokaryotic/S1/intermediate/10__trnascan-se/*.tRNA.gff', 'veba_output/binning/prokaryotic/S1/intermediate/10__trnascan-se/*.tRNA.struct']

Command:
cat veba_output/binning/prokaryotic/S1/intermediate/*__dastool/consensus_domain_classification/predictions.tsv > veba_output/binning/prokaryotic/S1/tmp/genomes_to_domain.tsv
OUTPUT_DIRECTORY=veba_output/binning/prokaryotic/S1/intermediate/10__trnascan-se
FP=veba_output/binning/prokaryotic/S1/tmp/genomes_to_domain.tsv
for DOMAIN in $(cut -f2 $FP | sort -u);
do
    DOMAIN_ABBREVIATION=$(echo $DOMAIN | python -c 'import sys; print(sys.stdin.read().upper()[:1])')

    # Get MAGs for each domain (not all will have passed QC)
    for ID in $(cat $FP | grep $DOMAIN | cut -f1)
    do
        GENOME_FASTA=$(ls veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/genomes/$ID.fa) || GENOME_FASTA=""
        if [ -e "$GENOME_FASTA" ]; then
            >$OUTPUT_DIRECTORY/$ID.tRNA
            >$OUTPUT_DIRECTORY/$ID.tRNA.gff
            >$OUTPUT_DIRECTORY/$ID.tRNA.struct
            >$OUTPUT_DIRECTORY/$ID.tRNA.txt

            TRNA_FASTA=$OUTPUT_DIRECTORY/$ID.tRNA

            if [[ -s "$TRNA_FASTA" ]];
                then
                    echo "[Skipping] [tRNAscan-SE] $GENOME_FASTA because tRNA fasta exists and is not empty"
                else
                    echo "[Running] [tRNAscan-SE] $GENOME_FASTA"
                    /expanse/projects/jcl110/miniconda3/envs/test_env/bin/tRNAscan-SE -$DOMAIN_ABBREVIATION --forceow --progress --threads 1 --fasta $OUTPUT_DIRECTORY/$ID.tRNA --gff $OUTPUT_DIRECTORY/$ID.tRNA.gff --struct $OUTPUT_DIRECTORY/$ID.tRNA.struct  $GENOME_FASTA > $OUTPUT_DIRECTORY/$ID.tRNA.txt
            fi
        fi
    done
done

rm -f veba_output/binning/prokaryotic/S1/tmp/genomes_to_domain.tsv

Running. .. ... .....

Log files:
veba_output/binning/prokaryotic/S1/log/10__trnascan-se.*

Duration: 00:14:21

Executing pipeline:  83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                | 10/12 [14:21<02:44, 82.45s/it]=====================
. 11__featurecounts .
=====================
Input: ['S1_scaffolds.fasta', 'veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.gff', 'S1_mapped.sorted.bam']
Output: ['veba_output/binning/prokaryotic/S1/intermediate/11__featurecounts/featurecounts.orfs.tsv.gz']

Command:
mkdir -p veba_output/binning/prokaryotic/S1/tmp/featurecounts && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/featureCounts -G S1_scaffolds.fasta -a veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.gff -o veba_output/binning/prokaryotic/S1/intermediate/11__featurecounts/featurecounts.orfs.tsv -F GTF --tmpDir veba_output/binning/prokaryotic/S1/tmp/featurecounts -T 1 -g gene_id -t CDS -p --countReadPairs S1_mapped.sorted.bam && gzip -f veba_output/binning/prokaryotic/S1/intermediate/11__featurecounts/featurecounts.orfs.tsv

Validating the following input files:
[=] File exists (39 MB): S1_scaffolds.fasta
[=] File exists (9 MB): veba_output/binning/prokaryotic/S1/intermediate/2__pyrodigal/gene_models.gff
[=] File exists (185 MB): S1_mapped.sorted.bam

Running. .. ... .....

Log files:
veba_output/binning/prokaryotic/S1/log/11__featurecounts.*

Validating the following output files:
[=] File exists (365923 bytes): veba_output/binning/prokaryotic/S1/intermediate/11__featurecounts/featurecounts.orfs.tsv.gz

Duration: 00:14:27

Executing pipeline:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 11/12 [14:27<00:59, 59.39s/it]===================
. 12__consolidate .
===================
Input: ['veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/checkm2_results.filtered.tsv', 'veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/genomes/*', 'veba_output/binning/prokaryotic/S1/intermediate/11__featurecounts/featurecounts.orfs.tsv.gz']
Output: ['veba_output/binning/prokaryotic/S1/output/scaffolds_to_bins.tsv', 'veba_output/binning/prokaryotic/S1/output/bins.list', 'veba_output/binning/prokaryotic/S1/output/binned.list', 'veba_output/binning/prokaryotic/S1/output/unbinned.fasta', 'veba_output/binning/prokaryotic/S1/output/genomes', 'veba_output/binning/prokaryotic/S1/output/checkm2_results.filtered.tsv', 'veba_output/binning/prokaryotic/S1/output/featurecounts.orfs.tsv.gz', 'veba_output/binning/prokaryotic/S1/output/genome_statistics.tsv', 'veba_output/binning/prokaryotic/S1/output/gene_statistics.cds.tsv', 'veba_output/binning/prokaryotic/S1/output/gene_statistics.rRNA.tsv', 'veba_output/binning/prokaryotic/S1/output/gene_statistics.tRNA.tsv']

Command:

rm -rf veba_output/binning/prokaryotic/S1/output/*
mkdir -p veba_output/binning/prokaryotic/S1/output/genomes
S2B=$(ls veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/scaffolds_to_bins.tsv) || (echo 'No genomes have been detected' && exit 1)

 cat veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/scaffolds_to_bins.tsv > veba_output/binning/prokaryotic/S1/output/scaffolds_to_bins.tsv && cat veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/bins.list > veba_output/binning/prokaryotic/S1/output/bins.list && cat veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/binned.list > veba_output/binning/prokaryotic/S1/output/binned.list && '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/concatenate_dataframes.py' -a 0 veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/checkm2_results.filtered.tsv > veba_output/binning/prokaryotic/S1/output/checkm2_results.filtered.tsv && DST=veba_output/binning/prokaryotic/S1/output/genomes; for SRC in veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/genomes/*; do SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST; done

DIR_RRNA=veba_output/binning/prokaryotic/S1/intermediate/9__barrnap
DIR_TRNA=veba_output/binning/prokaryotic/S1/intermediate/10__trnascan-se
OUTPUT_DIRECTORY=veba_output/binning/prokaryotic/S1/output/genomes
mkdir -p $OUTPUT_DIRECTORY

for GENOME_FASTA in veba_output/binning/prokaryotic/S1/intermediate/*__checkm2/filtered/genomes/*.fa;
do
    ID=$(basename $GENOME_FASTA .fa)
    DIR_GENOME=$(dirname $GENOME_FASTA)
    GFF_CDS=$DIR_GENOME/$ID.gff
    GFF_RRNA=$DIR_RRNA/$ID.rRNA.gff
    GFF_TRNA=$DIR_TRNA/$ID.tRNA.gff
    GFF_OUTPUT=$OUTPUT_DIRECTORY/$ID.gff
    >$GFF_OUTPUT.tmp
    '/expanse/projects/jcl110/miniconda3/envs/test_env/bin/scripts/compile_gff.py' -f $GENOME_FASTA -o $GFF_OUTPUT.tmp -n $ID -c $GFF_CDS -r $GFF_RRNA -t $GFF_TRNA -d Prokaryotic
    mv $GFF_OUTPUT.tmp $GFF_OUTPUT
done

 DST=veba_output/binning/prokaryotic/S1/output/genomes; for SRC in veba_output/binning/prokaryotic/S1/intermediate/9__barrnap/*.rRNA; do SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST; done && DST=veba_output/binning/prokaryotic/S1/output/genomes; for SRC in veba_output/binning/prokaryotic/S1/intermediate/10__trnascan-se/*.tRNA; do SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST; done && SRC=veba_output/binning/prokaryotic/S1/intermediate/11__featurecounts/featurecounts.orfs.tsv.gz; DST=veba_output/binning/prokaryotic/S1/output; SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit stats -a -b -T -j 1 veba_output/binning/prokaryotic/S1/output/genomes/*.fa | python -c 'import sys, pandas as pd; df = pd.read_csv(sys.stdin, sep="   ", index_col=0); df.index = df.index.map(lambda x: x[:-3]); df.to_csv(sys.stdout, sep=" ")'> veba_output/binning/prokaryotic/S1/output/genome_statistics.tsv && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit stats -a -b -T -j 1 veba_output/binning/prokaryotic/S1/output/genomes/*.ffn | python -c 'import sys, pandas as pd; df = pd.read_csv(sys.stdin, sep="   ", index_col=0); df.index = df.index.map(lambda x: x[:-4]); df.to_csv(sys.stdout, sep=" ")'> veba_output/binning/prokaryotic/S1/output/gene_statistics.cds.tsv && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit stats -a -b -T -j 1 veba_output/binning/prokaryotic/S1/output/genomes/*.rRNA | python -c 'import sys, pandas as pd; df = pd.read_csv(sys.stdin, sep="    ", index_col=0); df.index = df.index.map(lambda x: x[:-5]); df.to_csv(sys.stdout, sep=" ")'> veba_output/binning/prokaryotic/S1/output/gene_statistics.rRNA.tsv && /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit stats -a -b -T -j 1 veba_output/binning/prokaryotic/S1/output/genomes/*.tRNA | python -c 'import sys, pandas as pd; df = pd.read_csv(sys.stdin, sep="   ", index_col=0); df.index = df.index.map(lambda x: x[:-5]); df.to_csv(sys.stdout, sep=" ")'> veba_output/binning/prokaryotic/S1/output/gene_statistics.tRNA.tsv && cat S1_scaffolds.fasta | /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit grep -v -f veba_output/binning/prokaryotic/S1/output/binned.list -j 1 | /expanse/projects/jcl110/miniconda3/envs/test_env/bin/seqkit seq -j 1 -m 1500 > veba_output/binning/prokaryotic/S1/output/unbinned.fasta && rm -rf veba_output/binning/prokaryotic/S1/tmp/*

Validating the following input files:
[=] File exists (146 bytes): veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/checkm2_results.filtered.tsv
[=] File exists (2 MB): veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/genomes/S1__METABAT2__P.1__bin.1.ffn
[=] File exists (2 MB): veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/genomes/S1__METABAT2__P.1__bin.1.fa
[=] File exists (1 MB): veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/genomes/S1__METABAT2__P.1__bin.1.faa
[=] File exists (1017224 bytes): veba_output/binning/prokaryotic/S1/intermediate/8__checkm2/filtered/genomes/S1__METABAT2__P.1__bin.1.gff
[=] File exists (365923 bytes): veba_output/binning/prokaryotic/S1/intermediate/11__featurecounts/featurecounts.orfs.tsv.gz

Running. .. ... .....

Log files:
veba_output/binning/prokaryotic/S1/log/12__consolidate.*

Validating the following output files:
[=] File exists (10706 bytes): veba_output/binning/prokaryotic/S1/output/scaffolds_to_bins.tsv
[=] File exists (25 bytes): veba_output/binning/prokaryotic/S1/output/bins.list
[=] File exists (6506 bytes): veba_output/binning/prokaryotic/S1/output/binned.list
[=] File exists (24 MB): veba_output/binning/prokaryotic/S1/output/unbinned.fasta
[=] Directory exists (6 MB): veba_output/binning/prokaryotic/S1/output/genomes
[=] File exists (146 bytes): veba_output/binning/prokaryotic/S1/output/checkm2_results.filtered.tsv
[=] File exists (365923 bytes): veba_output/binning/prokaryotic/S1/output/featurecounts.orfs.tsv.gz
[=] File exists (209 bytes): veba_output/binning/prokaryotic/S1/output/genome_statistics.tsv
[=] File exists (201 bytes): veba_output/binning/prokaryotic/S1/output/gene_statistics.cds.tsv
[=] File exists (166 bytes): veba_output/binning/prokaryotic/S1/output/gene_statistics.rRNA.tsv
[=] File exists (187 bytes): veba_output/binning/prokaryotic/S1/output/gene_statistics.tRNA.tsv

Duration: 00:14:30

Executing pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [14:30<00:00, 72.53s/it]

........................
Total duration: 00:14:30
........................

I'm running it with Docker on my MacOS system right now:

# Directories
VEBA_DATABASE=~/VDB-test
LOCAL_WORKING_DIRECTORY=$(pwd)
LOCAL_WORKING_DIRECTORY=$(realpath -m ${LOCAL_WORKING_DIRECTORY})
LOCAL_OUTPUT_PARENT_DIRECTORY=./
LOCAL_OUTPUT_PARENT_DIRECTORY=$(realpath -m ${LOCAL_OUTPUT_PARENT_DIRECTORY})
LOCAL_DATABASE_DIRECTORY=${VEBA_DATABASE} # /path/to/VEBA_DATABASE/
LOCAL_DATABASE_DIRECTORY=$(realpath -m ${LOCAL_DATABASE_DIRECTORY})

CONTAINER_INPUT_DIRECTORY=/volumes/input/
CONTAINER_OUTPUT_DIRECTORY=/volumes/output/
CONTAINER_DATABASE_DIRECTORY=/volumes/database/

# Parameters
ID=S1
FASTA=S1_scaffolds.fasta
BAM=S1_mapped.sorted.bam
NAME=binning-prokaryotic__S1
RELATIVE_OUTPUT_DIRECTORY=veba_output/binning/prokaryotic

# Command
CMD="binning-prokaryotic.py -f ${CONTAINER_INPUT_DIRECTORY}/${FASTA} -b ${CONTAINER_INPUT_DIRECTORY}/${BAM} -n ${ID} -o ${CONTAINER_OUTPUT_DIRECTORY}/${RELATIVE_OUTPUT_DIRECTORY} --veba_database ${CONTAINER_DATABASE_DIRECTORY} --skip_maxbin2"

# Docker
# Version
VERSION=1.4.1

# Image
DOCKER_IMAGE="jolespin/veba_binning-prokaryotic:${VERSION}"

# Run
docker run \
    --name ${NAME} \
    --rm \
    --volume ${LOCAL_WORKING_DIRECTORY}:${CONTAINER_INPUT_DIRECTORY}:ro \
    --volume ${LOCAL_OUTPUT_PARENT_DIRECTORY}:${CONTAINER_OUTPUT_DIRECTORY}:rw \
    --volume ${LOCAL_DATABASE_DIRECTORY}:${CONTAINER_DATABASE_DIRECTORY}:ro \
    ${DOCKER_IMAGE} \
    ${CMD}

It's working right now but I'll update once it's done running locally (takes a little bit longer) but I confirmed it got past the DAS Tool step successfully.

Note there aren't any changes in the binning-prokaryotic module between v1.4.1 -> v1.4.2 (which is why I didn't update the docker container for it): https://github.com/jolespin/veba/pull/41/commits/8502b7a81c976eeec466f5b6373894de342a75d0

One bit I can try is to slim down the prokaryotic binning module to allow for an updated VEBA-binning-prokaryotic_env with an updated DAS Tool version.

abissett commented 6 months ago

"Here you're using 1.2.0 /scratch3/bis068/conda_envs/vebaEnv1.2.0/ ":

I only used 1.2.0 to run check_scaffolds_to_bins.py, I then module loaded dastool 1.3.0 on our HPC and ran it via HPC (not from within VEBA container or env) in that example. I used that because I had it handy from a previous install using the conda install methods and was too lazy to install a more current version, or get the .py from git or the container, figuring the version for this step of creating the array didn't really matter............

I'm wanting to run the container versions if possible, it's much easier for me on the HPC here than setting local conda envs. I'm continually running into conflicts with all the conda envs / installs etc on hpc and the containers are much cleaner. I need to run them as singularity here, our HPC doesn't support docker......

To build the singularity containers: singularity pull docker://jolespin/veba_binning-prokaryotic:1.4.1

The containers for the previous modules worked as expected (preprocess, assemble, virbin), all built and deployed the same way.

abissett commented 6 months ago

sorry, that version of dastool above be 1.1.3, not 1.3.0!

jolespin commented 6 months ago

I'm wanting to run the container versions if possible, it's much easier for me on the HPC here than setting local conda envs. I'm continually running into conflicts with all the conda envs / installs etc on hpc and the containers are much cleaner. I need to run them as singularity here, our HPC doesn't support docker...

Yea, neither does ours. I typically test locally with Docker. I haven't been able to get singularity installed and running correctly our our HPC unfortunately.

Do you install singularity with conda/mamba?

Also, how can you specify the volume mount points with singularity? I can try getting a working walkthrough for you.

abissett commented 6 months ago

In the above examples I'm running singularity as an HPC module (as in "module load singularity", so it's installed by the sys admin, not by me in this case. I've previously installed from source as per https://docs.sylabs.io/guides/3.0/user-guide/installation.html, but it's preferred for me to use the module version.

I haven't tried running your container interactively, only via "Singularity run etc" as above. I think the bind/mounts points are set/enabled/allowed by your admin in this case. You can use "singularity shell" to use the container interactively, and you can specify the mount points as per here, https://docs.sylabs.io/guides/3.0/user-guide/bind_paths_and_mounts.html.

I'll take look at running it interactively later today. Is there a specific thing your looking for if I do?

jolespin commented 6 months ago

Ok I believe our HPC has the same usage for singularity

https://www.sdsc.edu/support/user_guides/expanse.html#modules

I'm pushing a new update Feb 1st so I'll give this a try when I do. Apologies for the inconvenience. These containers are supposed to solve these types of issues.

abissett commented 5 months ago

Any progress here? I downloaded the container for 1.5.0 (veba_binning-prokaryotic_1.5.0) and ran it, but encountered the same error with the dastools step.

Here are the last few lines of the dastool log, indicating the same issue.......

processing query block 1, reference block 1/1, shape 2/2, index chunk 4/4. Building reference seed array... [0.093s] Building query seed array... [0.008s] Computing hash join... [0.072s] Building seed filter... [0.002s] Searching alignments... [0.004s] Deallocating buffers... [0.006s] Clearing query masking... [0.001s] Computing alignments... [2.095s] Deallocating reference... [0s] Loading reference sequences... [0s] Deallocating buffers... [0s] Deallocating queries... [0s] Loading query sequences... [0s] Closing the input file... [0s] Closing the output file... [0s] Closing the database file... [0s] Deallocating taxonomy... [0s] Total time = 4.266s Reported 10594 pairwise alignments, 10594 HSPs. 10594 queries aligned. The host system is detected to have 540 GB of RAM. It is recommended to increase the block size for better performance using these parameters : -b12 -c1 starting annotations of single copy cogs... successfully finished calculating contig lengths. ERROR: DAS_Tool R-package (version 1.1.2) is not installed Please install the current version of DAS_Tool using: $ cd DAS_Tool_installation_directory $ R CMD INSTALL package/DASTool_1.1.2.tar.gz Or read the documentation for more detailed instructions Classification took 0.047435760498046875 seconds. 0.0 sequences per second. 0.0 base pairs per second. Classification done.

jolespin commented 5 months ago

Looking into this right now. I did some setup changes on the repository so I'll double check that the docker container is working as expected.

Here's a test to pull and check for DAS_Tool:

(base) jespinozlt2-osx:~ jespinoz$ docker pull jolespin/veba_binning-prokaryotic:1.5.0
1.5.0: Pulling from jolespin/veba_binning-prokaryotic
Digest: sha256:42664a94adea901acf6ab0dda7543c1542aa58d7ed52b9f1acb668fcf615a727
Status: Image is up to date for jolespin/veba_binning-prokaryotic:1.5.0
docker.io/jolespin/veba_binning-prokaryotic:1.5.0

What's Next?
  View a summary of image vulnerabilities and recommendations → docker scout quickview jolespin/veba_binning-prokaryotic:1.5.0
(base) jespinozlt2-osx:~ jespinoz$ docker run --name VEBA-binning-prokaryotic --rm -it jolespin/veba_binning-prokaryotic:1.5.0  bash
(base) mambauser@da92f02d68c6:/tmp$ DAS_Tool -h

DAS Tool version 1.1.2

Usage: DAS_Tool -i methodA.scaffolds2bin,...,methodN.scaffolds2bin
                -l methodA,...,methodN -c contigs.fa -o myOutput

   -i, --bins                 Comma separated list of tab separated scaffolds to bin tables.
   -c, --contigs              Contigs in fasta format.
   -o, --outputbasename       Basename of output files.
   -l, --labels               Comma separated list of binning prediction names. (optional)
   --search_engine            Engine used for single copy gene identification [blast/diamond/usearch].
                              (default: usearch)
   --write_bin_evals          Write evaluation for each input bin set [0/1]. (default: 1)
   --create_plots             Create binning performance plots [0/1]. (default: 1)
   --write_bins               Export bins as fasta files  [0/1]. (default: 0)
   --write_unbinned           Report unbinned contigs. To export as fasta file also set write_bins==1 [0/1]. (default: 0)
   --proteins                 Predicted proteins in prodigal fasta format (>scaffoldID_geneNo).
                              Gene prediction step will be skipped if given. (optional)
   -t, --threads              Number of threads to use. (default: 1)
   --score_threshold          Score threshold until selection algorithm will keep selecting bins [0..1].
                              (default: 0.5)
   --duplicate_penalty        Penalty for duplicate single copy genes per bin (weight b).
                              Only change if you know what you're doing. [0..3]
                              (default: 0.6)
   --megabin_penalty          Penalty for megabins (weight c). Only change if you know what you're doing. [0..3]
                              (default: 0.5)
   --db_directory             Directory of single copy gene database. (default: install_dir/db)
   --resume                   Use existing predicted single copy gene files from a previous run [0/1]. (default: 0)
   --debug                    Write debug information to log file.
   -v, --version              Print version number and exit.
   -h, --help                 Show this message.

Example 1: Run DAS Tool on binning predictions of MetaBAT, MaxBin, CONCOCT and tetraESOMs. Output files will start with the prefix DASToolRun1:
   DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv,sample_data/sample.human.gut_metabat_scaffolds2bin.tsv,sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv -l concoct,maxbin,metabat,tetraESOM -c sample_data/sample.human.gut_contigs.fa -o sample_output/DASToolRun1

Example 2:  Run DAS Tool again with different parameters. Use the proteins predicted in Example 1 to skip the gene prediction step. Set the number of threads to 2 and score threshold to 0.1. Output files will start with the prefix DASToolRun2:
   DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv,sample_data/sample.human.gut_metabat_scaffolds2bin.tsv,sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv -l concoct,maxbin,metabat,tetraESOM -c sample_data/sample.human.gut_contigs.fa -o sample_output/DASToolRun2 --threads 2 --score_threshold 0.6 --proteins sample_output/DASToolRun1_proteins.faa

Please cite: Sieber et al., 2018, Nature Microbiology (https://doi.org/10.1038/s41564-018-0171-1).

(base) mambauser@da92f02d68c6:/tmp$ DAS_Tool --version

DAS Tool version 1.1.2

I just ran the new container on my local machine and it worked:

# Directories
VEBA_DATABASE=~/VDB-test
LOCAL_WORKING_DIRECTORY=$(pwd)
LOCAL_WORKING_DIRECTORY=$(realpath -m ${LOCAL_WORKING_DIRECTORY})
LOCAL_OUTPUT_PARENT_DIRECTORY=./
LOCAL_OUTPUT_PARENT_DIRECTORY=$(realpath -m ${LOCAL_OUTPUT_PARENT_DIRECTORY})
LOCAL_DATABASE_DIRECTORY=${VEBA_DATABASE} # /path/to/VEBA_DATABASE/
LOCAL_DATABASE_DIRECTORY=$(realpath -m ${LOCAL_DATABASE_DIRECTORY})

CONTAINER_INPUT_DIRECTORY=/volumes/input/
CONTAINER_OUTPUT_DIRECTORY=/volumes/output/
CONTAINER_DATABASE_DIRECTORY=/volumes/database/

# Parameters
ID=S1
FASTA=S1_scaffolds.fasta
BAM=S1_mapped.sorted.bam
NAME=binning-prokaryotic__S1
RELATIVE_OUTPUT_DIRECTORY=veba_output/binning/prokaryotic

# Command
CMD="binning-prokaryotic.py -f ${CONTAINER_INPUT_DIRECTORY}/${FASTA} -b ${CONTAINER_INPUT_DIRECTORY}/${BAM} -n ${ID} -o ${CONTAINER_OUTPUT_DIRECTORY}/${RELATIVE_OUTPUT_DIRECTORY} --veba_database ${CONTAINER_DATABASE_DIRECTORY} --skip_maxbin2"

# Docker
# Version
VERSION=1.5.0

# Image
DOCKER_IMAGE="jolespin/veba_binning-prokaryotic:${VERSION}"

# Run
docker run \
    --name ${NAME} \
    --rm \
    --volume ${LOCAL_WORKING_DIRECTORY}:${CONTAINER_INPUT_DIRECTORY}:ro \
    --volume ${LOCAL_OUTPUT_PARENT_DIRECTORY}:${CONTAINER_OUTPUT_DIRECTORY}:rw \
    --volume ${LOCAL_DATABASE_DIRECTORY}:${CONTAINER_DATABASE_DIRECTORY}:ro \
    ${DOCKER_IMAGE} \
    ${CMD}

....

Validating the following output files:
[=] File exists (10706 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/scaffolds_to_bins.tsv
[=] File exists (25 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/bins.list
[=] File exists (6506 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/binned.list
[=] File exists (24 MB): /volumes/output//veba_output/binning/prokaryotic/S1/output/unbinned.fasta
[=] Directory exists (6 MB): /volumes/output//veba_output/binning/prokaryotic/S1/output/genomes
[=] File exists (146 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/checkm2_results.filtered.tsv
[=] File exists (365909 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/featurecounts.orfs.tsv.gz
[=] File exists (209 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/genome_statistics.tsv
[=] File exists (201 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/gene_statistics.cds.tsv
[=] File exists (166 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/gene_statistics.rRNA.tsv
[=] File exists (187 bytes): /volumes/output//veba_output/binning/prokaryotic/S1/output/gene_statistics.tRNA.tsv

Duration: 01:05:12

........................
Total duration: 01:05:12
........................

Can you try running the Docker container on your local machine?

The binning is pretty low resource so it shouldn't be too hard on your system.

Tomorrow I'll try and contain our server company to use singularity so I can test.

jolespin commented 5 months ago

I'm trying this right now but having issues with singularity loading the correct PATH within the container.

Hopefully I can get some help on this: https://stackoverflow.com/questions/77958891/singularity-exec-is-not-recognizing-executables-in-container-path-converted-mi

Apologies for the delay. I'll try to get this resolved ASAP. In the meantime, I have a work around that just worked for me:

declare -xr SINGULARITY_MODULE='singularitypro/3.9'

module purge
module load "${SINGULARITY_MODULE}"

# Local directories
VEBA_DATABASE=/expanse/projects/jcl110/db/veba/VDB_v6/
LOCAL_WORKING_DIRECTORY=$(pwd)
LOCAL_WORKING_DIRECTORY=$(realpath -m ${LOCAL_WORKING_DIRECTORY})
LOCAL_DATABASE_DIRECTORY=${VEBA_DATABASE} # /path/to/VEBA_DATABASE/
LOCAL_DATABASE_DIRECTORY=$(realpath -m ${LOCAL_DATABASE_DIRECTORY})

# Container directories
CONTAINER_INPUT_DIRECTORY=/volumes/input/
CONTAINER_OUTPUT_DIRECTORY=/volumes/output/
CONTAINER_DATABASE_DIRECTORY=/volumes/database/

FASTA=${CONTAINER_INPUT_DIRECTORY}/veba_output/assembly/S1/output/scaffolds.fasta
BAM=${CONTAINER_INPUT_DIRECTORY}/veba_output/assembly/S1/output/mapped.sorted.bam
OUTPUT_DIRECTORY=${CONTAINER_OUTPUT_DIRECTORY}/test_output/
NAME="S1"

SINGULARITY_IMAGE="containers/veba_binning-prokaryotic__1.5.0.sif"
singularity exec \
    --bind ${LOCAL_WORKING_DIRECTORY}:${CONTAINER_INPUT_DIRECTORY},${LOCAL_WORKING_DIRECTORY}:${CONTAINER_OUTPUT_DIRECTORY},${LOCAL_DATABASE_DIRECTORY}:${CONTAINER_DATABASE_DIRECTORY} \
     ${SINGULARITY_IMAGE} \
    bash -c \
"export PATH=/usr/bin/:/opt/conda/bin/; export CONDA_PREFIX=/opt/conda/; binning-prokaryotic.py -f ${FASTA} -b ${BAM} -n ${NAME} -o ${OUTPUT_DIRECTORY} --veba_database ${CONTAINER_DATABASE_DIRECTORY} --skip_maxbin2"

...


Validating the following output files:
[=] File exists (52907 bytes): /volumes/output//test_output/S1/output/scaffolds_to_bins.tsv
[=] File exists (46 bytes): /volumes/output//test_output/S1/output/bins.list
[=] File exists (33923 bytes): /volumes/output//test_output/S1/output/binned.list
[=] File exists (22 MB): /volumes/output//test_output/S1/output/unbinned.fasta
[=] Directory exists (13 MB): /volumes/output//test_output/S1/output/genomes
[=] File exists (214 bytes): /volumes/output//test_output/S1/output/checkm2_results.filtered.tsv
[=] File exists (365908 bytes): /volumes/output//test_output/S1/output/featurecounts.orfs.tsv.gz
[=] File exists (312 bytes): /volumes/output//test_output/S1/output/genome_statistics.tsv
[=] File exists (299 bytes): /volumes/output//test_output/S1/output/gene_statistics.cds.tsv
[=] File exists (229 bytes): /volumes/output//test_output/S1/output/gene_statistics.rRNA.tsv
[=] File exists (229 bytes): /volumes/output//test_output/S1/output/gene_statistics.tRNA.tsv

Duration: 00:28:58

Executing pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [28:58<00:00, 72.42s/it]

........................
Total duration: 00:28:58
........................

I built the singularity container like this:

singularity pull containers/veba_binning-prokaryotic__1.5.0.sif docker://jolespin/veba_binning-prokaryotic:1.5.0
abissett commented 5 months ago

Hi Josh, I pulled the container as above (also using "apptainer pull" rather than "singularity", and then ran the binning workflow on the S1 example and a real metagenomic samlple. Both completed with the new container. I didn't run the workflow interactively from within the container as you did above.......

In case it's useful I simply ran the workflow on HPC via SLURm scheduler as below.

Thanks for updating, it seems to have fixed the issue!

As an aside (maybe I should add it to a new "issue"?), I did run into an error with checkM2 not liking paths greater than some number of characters (OSError: AF_UNIX path too long). Simply reducing the number of characters in the paths removed the error, I don't think it's an issue if running your default path names from the examples, but could be if testing things and making output paths longer (as in my case).

#!/bin/bash

#SBATCH --job-name=ProcBin
#SBATCH --time 72:02:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task 1
#SBATCH --mem=256gb

export OMP_NUM_THREADS=32
export VEBA_DATABASE=<PATH_TO_veba/db>

N_JOBS=32
N_ITER=4

in=S1

ID=${in}

OUT_DIR=veba_output/binning/prok

FASTA=veba_output/binning/viral/${ID}/output/unbinned.fasta
BAM=veba_output/assembly/${ID}/output/mapped.sorted.bam

module load apptainer

CMD="apptainer run veba_binning-prokaryotic_1.5.0.sif binning-prokaryotic.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -o ${OUT_DIR} -m 1500 -I ${N_ITER} --skip_maxbin2"

echo ${CMD}

${CMD}
jolespin commented 5 months ago

This is EXTREMELY useful information. Thank you! I'm still learning about singularity and have never heard of apptainer but this seems like a much more straightforward implementation.

As an aside (maybe I should add it to a new "issue"?), I did run into an error with checkM2 not liking paths greater than some number of characters (OSError: AF_UNIX path too long). Simply reducing the number of characters in the paths removed the error, I don't think it's an issue if running your default path names from the examples, but could be if testing things and making output paths longer (as in my case).

I've encountered this issue too. One workaround I've used is by specifying the temporary directory:

https://github.com/jolespin/veba/blob/c27c0d639a31246ca05613a4f79858416fdbe6b0/bin/binning-prokaryotic.py#L363

Give this a try here:

--tmpdir ./tmp/

Can you create a new issue for this? I feel like other people have encountered this before too (i certainly have).