rhysnewell / rosella

Metagenomic Binning Algorithm
BSD 3-Clause "New" or "Revised" License
38 stars 3 forks source link

Flight failing when generating combinations of tids: empty tid list due to contig names not matching in assembly and coverage file #37

Closed gabrieleghiotto closed 12 months ago

gabrieleghiotto commented 1 year ago

Hy, I am trying to implement rosella in my binning workflow, but I have some problems with flight. The error is the following and I attached also the conda environment which I created using the updated yaml (24/05/23).

[2023-05-25T08:34:25Z INFO rosella] rosella version 0.4.2 [2023-05-25T08:34:25Z INFO rosella] Using min-covered-fraction 0% [2023-05-25T08:34:25Z INFO rosella] Using min-read-aligned-percent 0% [00:00:03] ████████████████████████████████████████ 338614/338614 Read results from previous run. If this is not desired please rerun with --force... ETA: [0s] [00:00:07] ⠠ Calculating UMAP embeddings and clustering... 3/6
[2023-05-25T08:34:37Z ERROR bird_tool_utils::command] Error when running flight process. Exitstatus was : ExitStatus(unix_wait_status(256)) thread 'main' panicked at 'Failed to grab stderr from failed flight process', /home/conda/.cargo/registry/src/github.com-1ecc6299db9ec823/bird_tool_utils-0.3.0/src/command.rs:17:14 env.txt

rhysnewell commented 1 year ago

Hi @gabrieleghiotto , apologies for the delay I've been swamped but had some inspiration this weekend. v0.5.0 should fix this issue if you create a new conda environment and install from the bioconda channel. If you still get an error from flight, please open a new issue and post the error log that gets produced. The flight error should be exposed to users now.

gabrieleghiotto commented 1 year ago

Dear Rhys, I finally managed to install it however when trying to use it to perform the binning (rosella recover) i am getting the following error. Any clues?

Il giorno sab 4 nov 2023 alle ore 21:57 Rhys Newell < @.***> ha scritto:

Hi @gabrieleghiotto https://github.com/gabrieleghiotto , apologies for the delay I've been swamped but had some inspiration this weekend. v0.5.0 should fix this issue if you create a new conda environment and install from the bioconda channel. If you still get an error from flight, please open a new issue and post the error log that gets produced. The flight error should be exposed to users now.

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1793552087, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5GNI4ROTB3KJURT5K3YC2TZZAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGU2TEMBYG4 . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 1 year ago

Hi @gabrieleghiotto , I think you may have forgot to attach the error you are receiving

gabrieleghiotto commented 1 year ago

COMMANDS: conda activate coverm coverm contig -m metabat --bam-files ../*.sorted.bam -o coverm.cov --threads 10 conda activate rosella rosella recover -i coverm.cov -r ../../assembly_spades/r1_assembly.fa --output-directory rosella_out/ --threads 10

OUTPUT: [2023-11-06T08:24:46Z INFO rosella] rosella version 0.4.2 [2023-11-06T08:24:46Z INFO rosella] Using min-covered-fraction 0% [2023-11-06T08:24:46Z INFO rosella] Using min-read-aligned-percent 0% [2023-11-06T08:24:46Z INFO rosella::utils] Generating reference index thread 'main' panicked at 'Unable to generate index: Failed to read fasta index from "../../assembly_spades/r1_assembly.fa.fai"

Caused by: 0: No such file or directory (os error 2) 1: No such file or directory (os error 2)', src/utils.rs:734:10 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Il giorno lun 6 nov 2023 alle ore 09:05 Rhys Newell < @.***> ha scritto:

Hi @gabrieleghiotto https://github.com/gabrieleghiotto , I think you may have forgot to attach the error you are receiving

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1794267742, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5F7A64E7VAUDPCIM43YDCK4NAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJUGI3DONZUGI . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 1 year ago

Thanks for posting the error, a couple of things. I would highly recommend upgrading to v0.5.0. It is faster, more stable, and should produce slightly better results. You'll need a fresh environment but it should be easily done via conda create -n rosella_v0.5.0 rosella=0.5.0 If you do update, your new rosella command would just have to change to the following:

rosella recover -C coverm.cov -r ../../assembly_spades/r1_assembly.fa -o rosella_out/ -t 10

The error you posted seems to indicate that rosella is unable to generate the fasta index for the provided assembly. The things you would need to check are:

  1. Does the assembly exist at the location specified? I do this all the time and is always worth double checking.
  2. Do you have write permission in the folder that the fasta file is stored? Rosella won't be able to generate the fasta index if your user does not have permission to write into the assembly folder.
  3. Check that the fasta file isn't corrupted somehow.

If none of that is true then try running samtools faidx ../../assembly_spades/r1_assembly.fa.fai and see what samtools has to say about it.

gabrieleghiotto commented 1 year ago

The initial problem was solved, however an new one emerged. As fare as I understood the software managed to produce the initial set of bins, however it failed in the refinement step. The same error was repeated for all 175 bins and no bins was present in the refined_bins output folder.

[2023-11-06T12:25:51Z ERROR rosella::refine::refinery] Traceback (most recent call last): [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] File "/home/bioinfo/anaconda3/envs/rosella/bin/flight", line 10, in [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] sys.exit(main()) [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] File "/home/bioinfo/anaconda3/envs/rosella/lib/python3.10/site-packages/flight/flight.py", line 456, in main [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] args.func(args) [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] File "/home/bioinfo/anaconda3/envs/rosella/lib/python3.10/site-packages/flight/flight.py", line 579, in refine [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] rosella.perform_refining(args) [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] File "/home/bioinfo/anaconda3/envs/rosella/lib/python3.10/site-packages/flight/rosella/rosella.py", line 433, in perform_refining [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] self.slow_refine(plots, 0, 10, x_min, x_max, y_min, y_max, False, [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] File "/home/bioinfo/anaconda3/envs/rosella/lib/python3.10/site-packages/flight/rosella/rosella.py", line 136, in slow_refine [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] plots, n = self.validate_bins(plots, n, [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] File "/home/bioinfo/anaconda3/envs/rosella/lib/python3.10/site-packages/flight/rosella/validating.py", line 393, in validate_bins [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] metrics.get_averages(np.concatenate((contigs.iloc[:, 3:].values, [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] File "/home/bioinfo/anaconda3/envs/rosella/lib/python3.10/site-packages/flight/metrics.py", line 927, in get_averages [2023-11-06T12:25:51Z ERROR rosella::refine::refinery] pairs = combinations(tids, 2) [00:06:12] [████████████████████████████████████████] 175/175 Finished refining MAGs [2023-11-06T12:25:51Z ERROR rosella] Recover Failed with error: Flight failed with exit code: exit status: 1

Il giorno lun 6 nov 2023 alle ore 09:36 Rhys Newell < @.***> ha scritto:

Thanks for posting the error, a couple of things. I would highly recommend upgrading to v0.5.0. It is faster, more stable, and should produce slightly better results. You'll need a fresh environment but it should be easily done via conda create -n rosella rosella=0.5.0 If you do update, your new rosella command would just have to change to the following:

rosella recover -C coverm.cov -r ../../assembly_spades/r1_assembly.fa -o rosella_out/ -t 10

The error you posted seems to indicate that rosella is unable to generate the fasta index for the provided assembly. The things you would need to check are:

  1. Does the assembly exist at the location specified? I do this all the time and is always worth double checking.
  2. Do you have write permission in the folder that the fasta file is stored? Rosella won't be able to generate the fasta index if your user does not have permission to write into the assembly folder.
  3. Check that the fasta file isn't corrupted somehow.

If none of that is true then try running samtools faidx ../../assembly_spades/r1_assembly.fa.fai and see what samtools has to say about it.

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1794315315, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5BCN25MULVQKHSBHFTYDCORHAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJUGMYTKMZRGU . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 1 year ago

Hi @gabrieleghiotto,

Interesting, looks like the final part of the error is clipped off. First could you show me what versions of packages conda installed for you (conda list while inside of your new rosella env). When you updated rosella, did you create a fresh environment (delete the old env and then call the new rosella as well) or just reuse the old environment (ran conda install rosella=0.5.0 in the pre-existing rosella env)?

Cheers, Rhys

gabrieleghiotto commented 12 months ago

When I performed the update I deleted the old environment. The new environment is the following:

packages in environment at /home/bioinfo/anaconda3/envs/rosella:

#

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge asttokens 2.4.1 pyhd8ed1ab_0 conda-forge backports 1.0 pyhd8ed1ab_3 conda-forge backports.functools_lru_cache 1.6.5 pyhd8ed1ab_0 conda-forge biopython 1.81 py310h2372a71_1 conda-forge brotli 1.1.0 hd590300_1 conda-forge brotli-bin 1.1.0 hd590300_1 conda-forge brotli-python 1.1.0 py310hc6cd4ac_1 conda-forge bwa 0.7.17 he4a0461_11 bioconda bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.21.0 hd590300_0 conda-forge ca-certificates 2023.7.22 hbcca054_0 conda-forge cachecontrol 0.13.1 pyhd8ed1ab_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge certifi 2023.7.22 pyhd8ed1ab_0 conda-forge charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge colorama 0.4.6 pyhd8ed1ab_0 conda-forge contourpy 1.2.0 py310hd41b1e2_0 conda-forge coverm 0.6.1 h07ea13f_6 bioconda cycler 0.12.1 pyhd8ed1ab_0 conda-forge cython 3.0.5 py310hc6cd4ac_0 conda-forge dashing 1.0 h40c17d1_2 bioconda decorator 5.1.1 pyhd8ed1ab_0 conda-forge exceptiongroup 1.1.3 pyhd8ed1ab_0 conda-forge executing 2.0.1 pyhd8ed1ab_0 conda-forge fastani 1.34 h4dfc31f_0 bioconda flight-genome 1.6.0 pyh7cba7a3_0 bioconda fonttools 4.44.0 py310h2372a71_0 conda-forge freetype 2.12.1 h267a509_2 conda-forge gsl 2.7 he838d99_0 conda-forge h5py 3.10.0 nompi_py310ha2ad45a_100 conda-forge hdbscan 0.8.33 py310h1f7b6fc_4 conda-forge hdf5 1.14.2 nompi_h4f84152_100 conda-forge hdmedians 0.14.2 py310h1f7b6fc_4 conda-forge htslib 1.18 h81da01d_0 bioconda icu 73.2 h59595ed_0 conda-forge idna 3.4 pyhd8ed1ab_0 conda-forge imageio 2.31.5 pyh8c1a49c_0 conda-forge iniconfig 2.0.0 pyhd8ed1ab_0 conda-forge ipython 8.17.2 pyh41d4057_0 conda-forge jedi 0.19.1 pyhd8ed1ab_0 conda-forge joblib 1.3.0 pyhd8ed1ab_1 conda-forge k8 0.2.5 hdcf5f25_4 bioconda keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.5 py310hd41b1e2_1 conda-forge krb5 1.21.2 h659d440_0 conda-forge lcms2 2.15 hb7c19ff_3 conda-forge ld_impl_linux-64 2.40 h41732ed_0 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libaec 1.1.2 h59595ed_1 conda-forge libblas 3.9.0 19_linux64_openblas conda-forge libbrotlicommon 1.1.0 hd590300_1 conda-forge libbrotlidec 1.1.0 hd590300_1 conda-forge libbrotlienc 1.1.0 hd590300_1 conda-forge libcblas 3.9.0 19_linux64_openblas conda-forge libcurl 8.4.0 hca28451_0 conda-forge libdeflate 1.19 hd590300_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 h516909a_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.2.0 h807b86a_2 conda-forge libgfortran-ng 13.2.0 h69a702a_2 conda-forge libgfortran5 13.2.0 ha4646dd_2 conda-forge libgomp 13.2.0 h807b86a_2 conda-forge libhwloc 2.9.3 default_h554bfaf_1009 conda-forge libiconv 1.17 h166bdaf_0 conda-forge libjpeg-turbo 3.0.0 hd590300_1 conda-forge liblapack 3.9.0 19_linux64_openblas conda-forge libllvm14 14.0.6 hcd5def8_4 conda-forge libnghttp2 1.55.1 h47da74e_0 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libopenblas 0.3.24 pthreads_h413a1c8_0 conda-forge libpng 1.6.39 h753d276_0 conda-forge libsqlite 3.44.0 h2797004_0 conda-forge libssh2 1.11.0 h0841786_0 conda-forge libstdcxx-ng 13.2.0 h7e041cc_2 conda-forge libtiff 4.6.0 ha9c0a0a_2 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libwebp-base 1.3.2 hd590300_0 conda-forge libxcb 1.15 h0b41bf4_0 conda-forge libxml2 2.11.5 h232c23b_1 conda-forge libzlib 1.2.13 hd590300_5 conda-forge llvmlite 0.40.1 py310h1b8f574_0 conda-forge lockfile 0.12.2 py_1 conda-forge matplotlib-base 3.8.1 py310h62c0568_0 conda-forge matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge minimap2 2.26 he4a0461_2 bioconda msgpack-python 1.0.6 py310hd41b1e2_0 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge natsort 8.4.0 pyhd8ed1ab_0 conda-forge ncurses 6.4 h59595ed_2 conda-forge numba 0.57.0 py310h0f6aa51_2 conda-forge numpy 1.23.5 py310h53a5b5f_0 conda-forge openblas 0.3.24 pthreads_h7a3da1a_0 conda-forge openjpeg 2.5.0 h488ebb8_3 conda-forge openssl 3.1.4 hd590300_0 conda-forge packaging 23.2 pyhd8ed1ab_0 conda-forge pandas 2.1.2 py310hcc13569_0 conda-forge parso 0.8.3 pyhd8ed1ab_0 conda-forge patsy 0.5.3 pyhd8ed1ab_0 conda-forge pebble 5.0.3 pyhd8ed1ab_0 conda-forge perl 5.32.1 4_hd590300_perl5 conda-forge pexpect 4.8.0 pyh1a96a4e_2 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 10.1.0 py310h01dd4db_0 conda-forge pip 23.3.1 pyhd8ed1ab_0 conda-forge platformdirs 3.11.0 pyhd8ed1ab_0 conda-forge pluggy 1.3.0 pyhd8ed1ab_0 conda-forge pooch 1.8.0 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.39 pyha770c72_0 conda-forge prompt_toolkit 3.0.39 hd8ed1ab_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge pygments 2.16.1 pyhd8ed1ab_0 conda-forge pynndescent 0.5.10 pyh1a96a4e_0 conda-forge pyparsing 3.1.1 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 pyha2e5f31_6 conda-forge pytest 7.4.3 pyhd8ed1ab_0 conda-forge python 3.10.0 h543edf9_3_cpython conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python-tzdata 2023.3 pyhd8ed1ab_0 conda-forge python_abi 3.10 4_cp310 conda-forge pytz 2023.3.post1 pyhd8ed1ab_0 conda-forge readline 8.2 h8228510_1 conda-forge requests 2.31.0 pyhd8ed1ab_0 conda-forge rosella 0.5.0 h8e1a5b0_0 bioconda samtools 1.18 h50ea8bc_1 bioconda scikit-bio 0.5.8 py310h0a54255_1 conda-forge scikit-learn 1.1.0 py310hffb9edd_0 conda-forge scipy 1.11.0 py310ha4c1d20_0 conda-forge seaborn 0.13.0 hd8ed1ab_0 conda-forge seaborn-base 0.13.0 pyhd8ed1ab_0 conda-forge setuptools 68.2.2 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge sqlite 3.44.0 h2c6b66d_0 conda-forge stack_data 0.6.2 pyhd8ed1ab_0 conda-forge starcode 1.4 h031d066_4 bioconda statsmodels 0.14.0 py310h1f7b6fc_2 conda-forge tbb 2021.10.0 h00ab1b0_2 conda-forge threadpoolctl 3.2.0 pyha21a80b_0 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge tomli 2.0.1 pyhd8ed1ab_0 conda-forge tqdm 4.66.1 pyhd8ed1ab_0 conda-forge traitlets 5.13.0 pyhd8ed1ab_0 conda-forge typing-extensions 4.8.0 hd8ed1ab_0 conda-forge typing_extensions 4.8.0 pyha770c72_0 conda-forge tzdata 2023c h71feb2d_0 conda-forge umap-learn 0.5.4 py310hff52083_0 conda-forge unicodedata2 15.1.0 py310h2372a71_0 conda-forge urllib3 2.0.7 pyhd8ed1ab_0 conda-forge wcwidth 0.2.9 pyhd8ed1ab_0 conda-forge wheel 0.41.3 pyhd8ed1ab_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge zlib 1.2.13 hd590300_5 conda-forge zstd 1.5.5 hfc55251_0 conda-forge

Il giorno lun 6 nov 2023 alle ore 23:15 Rhys Newell < @.***> ha scritto:

Hi @gabrieleghiotto https://github.com/gabrieleghiotto,

Interesting, looks like the final part of the error is clipped off. First could you show me what versions of packages conda installed for you (conda list while inside of your new rosella env). When you updated rosella, did you create a fresh environment (delete the old env and then call the new rosella as well) or just reuse the old environment (ran conda install rosella=0.5.0 in the pre-existing rosella env)?

Cheers, Rhys

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1796852658, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5BMJVEAOERVIID4OSTYDFOQ3AVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJWHA2TENRVHA . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 12 months ago

Hmm, no incorrect version numbers are jumping out at me so this is perplexing. Would you be able to send me the coverm coverages file and assembly you are using for binning and I'll see if I can reproduce the error on my end?

If you can't send the data, then could you rerun your rosella command but add 2>err.out on to the end of it and put the results of that file here? We might be missing a key piece of information in the original error message that will help us solve this

gabrieleghiotto commented 12 months ago

These are the files coverm.cov https://drive.google.com/file/d/1TrMKrujsFrbX9YKRqQspr4P5-eJGPVeb/view?usp=drive_web r1_assembly.fa https://drive.google.com/file/d/1sB2u9N25tisd9sX8Pz4HYLfZoBeIaXqg/view?usp=drive_web

Il giorno mar 7 nov 2023 alle ore 22:01 Rhys Newell < @.***> ha scritto:

Hmm, no incorrect version numbers are jumping out at me so this is perplexing. Would you be able to send me the coverm coverages file and assembly you are using for binning and I'll see if I can reproduce the error on my end?

If you can't send the data, then could you rerun your rosella command but add 2>err.out on to the end of it and put the results of that file here? We might be missing a key piece of information in the original error message that will help us solve this

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1800111670, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5D4DHJKOLKET5ZHCCLYDKOTLAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGEYTCNRXGA . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 12 months ago

Hi @gabrieleghiotto,

Just wanted to check if these are the right files? The contig ids in the coverm.cov file do not match up with the assembly you provided. Additionally, there are 8179 contigs in the coverm.cov file with a min size of 56 base pairs, and then 198614 contigs in the assembly file. Rosella won't run as these files don't match up at all

gabrieleghiotto commented 12 months ago

in the coverm file I saw 198614, I re-lauch it and now I am attaching it to the email. coverm_v2.cov https://drive.google.com/file/d/1gdSwu1BIlRR1PnSucofuvONfoc1FQxyr/view?usp=drive_web The procedure that I used was assembly with spades, renaming contigs, generating index with bowtie2-build, align filtered reads on assembly with bowtie2 and then converting to BAM and sorting. Then I launched coverm as following: coverm contig --methods metabat --bam-files *.sorted.bam -o coverm_v2.cov --threads 10 May I try to use the reads instead of the alignments?

Il giorno mar 7 nov 2023 alle ore 23:37 Rhys Newell < @.***> ha scritto:

Hi @gabrieleghiotto https://github.com/gabrieleghiotto,

Just wanted to check if these are the right files? The contig ids in the coverm.cov file do not match up with the assembly you provided. Additionally, there are 8179 contigs in the coverm.cov file with a min size of 56 base pairs, and then 198614 contigs in the assembly file. Rosella won't run as these files don't match up at all

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1800308588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5ANCD32SUJSSLT7AXLYDKZ2RAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGMYDQNJYHA . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 12 months ago

Okay, my mistake I had two files called coverm.cov in my downloads folder which caused an issue on my end.

But I have figured out what is wrong with your file. The contig names in your coverage file do not match up with the contig names in assembly. They coverage file just has numeric values, I believe this is because you renamed the contigs when performing mapping for some reason.

What you should do is pass the reads directly to rosella or to coverm, both tools will produce the coverage file correctly for you if you do that. Alternatively, you could just not rename the contigs. Passing reads to rosella handles the calls to coverm and ensures everything is properly formatted for you, so I suggest that method

gabrieleghiotto commented 12 months ago

Thank you. I will try with this approach. However, it is strange since to perform the alignments i used the assembly with renamed contigs to create the index for the alignment. I will update you.

Il giorno mer 8 nov 2023 alle ore 01:26 Rhys Newell < @.***> ha scritto:

Okay, my mistake I had two files called coverm.cov in my downloads folder which caused an issue on my end.

But I have figured out what is wrong with your file. The contig names in your coverage file do not match up with the contig names in assembly. They coverage file just has numeric values, I believe this is because you renamed the contigs when performing mapping for some reason.

What you should do is pass the reads directly to rosella or to coverm, both tools will produce the coverage file correctly for you if you do that. Alternatively, you could just not rename the contigs

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1800723962, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5AA3RHBXVMOIHP6GQTYDLGTPAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQG4ZDGOJWGI . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 12 months ago

You are right, the contig ids do match up but I think renaming the contigs to be sequential numeric values is causing issues somewhere. I'm trying to pinpoint where exactly, but I would strongly advise against giving contigs such simple names in general. The IDs given by the assembler used is generally good enough

gabrieleghiotto commented 12 months ago

Dear Rhys, thank you. This is a step of a metagenomic pipeline that we developed in our lab where we rename contigs and remove those with length lower than

  1. This avoid issues in the downstream analysis. But i guess I can try. Thanks again

On Wed, 8 Nov 2023 at 23:29, Rhys Newell @.***> wrote:

You are right, the contig ids do match up but I think renaming the contigs to be sequential numeric values is causing issues somewhere. I'm trying to pinpoint where exactly, but I would strongly advise against giving contigs such simple names in general. The IDs given by the assembler used is generally good enough

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1802780864, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5GMVGWL7HR67UJQPW3YDQBTTAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSG44DAOBWGQ . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 12 months ago

Hi @gabrieleghiotto ,

I actually managed to pinpoint the issue. Python/pandas was reading in the coverage file and setting the type of the contigName column to be an integer whilst the code was operating as if it were a string. Contig IDs would be read in directly from the assembly as strings, and then comparisons to the coverage file would fail.

If you run pip install flight-genome==1.6.1 from within your rosella conda environment it should install a version of flight that will fix your issue and you will be able to run your binning as normal. I would advise maybe amending your in house pipeline to generate more informative contig headers though, perhaps containing sample id information and length information. Contig headers should aim to be unique as it makes for comparison across samples much easier.

I've tested the fix on your assembly and it generated the results as expected, so I'll go ahead and close this issue now

gabrieleghiotto commented 12 months ago

Dear Rhys, by updating fligth all it is working fine. I have a question however: in the output folder there are a bunch of bins called rosella_bin_X.fa and another group called rosella_refined_0_X.fna, where X is the progressive number given by the software. Which is the final set of bins that I should consider? I want also to point out that the Documentation section is lacking some example commands, however I imagine it is a work in progress.

Il giorno gio 9 nov 2023 alle ore 06:48 Rhys Newell < @.***> ha scritto:

Closed #37 https://github.com/rhysnewell/rosella/issues/37 as completed.

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#event-10908735260, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5ERM7JDDIDUI6IBD3TYDRVDRAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQHEYDQNZTGUZDMMA . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 12 months ago

All of the bins represent the final set of bins. There are just some bins that were unchanged during the initial refining process so they retain their original bin name

Documentation is indeed a work in progress

gabrieleghiotto commented 12 months ago

Thanks, but again the name of bins is the same but the refined 0_ string. Thus my question is rosella_refined_100.fna and rosella_bin_100.fna are not the same bin from what you are saying right? I posed myself this question because with metabat1, metabat2 and vamb I obtained between 140-150 bins, while by summing all of them they are 264 so a bit strange.

On Thu, 9 Nov 2023 at 20:06, Rhys Newell @.***> wrote:

All of the bins represent the final set of bins. There are just some bins that were unchanged during the initial refining process so they retain their original bin name

Documentation is indeed a work in progress

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1804431656, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5DGQ75G2W32XVET6WDYDUSTRAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBUGQZTCNRVGY . You are receiving this because you were mentioned.Message ID: @.***>

rhysnewell commented 12 months ago

You can double check duplicate contigs by running the following in the output directory: 'grep -h '>' *.fna | sort | uniq -d'

As long as that grep command is correct (operating off memory) and it only displays the contig name without the file path or file name, then it will sort and only display contigs that appeared multiple times.

gabrieleghiotto commented 12 months ago

No contig is displayed so I can confirm that they are all different as you sad. I would suggest, just from a user perspective, to add a renaming step at the end otherwise the output looks confusing to the fist eye. Thanks again for all the effort and the exhaustive support.

Il giorno gio 9 nov 2023 alle ore 20:43 Rhys Newell < @.***> ha scritto:

You can double check duplicate contigs by running the following in the output directory: 'grep -h '>' *.fna | sort | uniq -d'

As long as that grep command is correct (operating off memory) and it only displays the contig name without the file path or file name, then it will sort and only display contigs that appeared multiple times.

— Reply to this email directly, view it on GitHub https://github.com/rhysnewell/rosella/issues/37#issuecomment-1804525125, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMJI5GEQ3B67M2XGQ3FNFDYDUW7HAVCNFSM6AAAAAAYOQCQIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBUGUZDKMJSGU . You are receiving this because you were mentioned.Message ID: @.***>