mdsufz / MuDoGeR

MuDoGeR makes the recovery of genomes from prokaryotes, viruses, and eukaryotes from metagenomes easy.
GNU General Public License v3.0
86 stars 9 forks source link

Collection of issues during testing #15

Closed mherold1 closed 2 years ago

mherold1 commented 2 years ago

Hi, thanks for making this available it seems like a very useful pipeline. The last few days I tested it a bit, mainly for running viral classification and wanted to share some issues I encountered. Since you released a new version earlier today, I'd like to mention that all of this is related to v1.0. After running tests are concluded I will try to update. On that note, what would be the best way to update? pull the repository and rerun the installation script?

installation

some tools did not install correctly and had to be fixed individually

khmer didnt install

conda activate /mnt/RAID5/tools/miniconda3/envs/mudoger_env/dependencies/conda/envs/khmer_env
pip install khmer==2.1.1

java missing in vcontact step

/bin/sh: 1: java: not found
ERROR:vcontact2: Error in contig clustering

installed openjdk on system

maxbin2 dependencies

Can't locate LWP/Simple.pm ....

conda activate /mnt/RAID5/tools/miniconda3/envs/mudoger_env/dependencies/conda/envs/metawrap_env
conda install -c conda-forge -c bioconda maxbin2=2.2.* perl=5.26
maxbin2                                  2.2.6-h14c3975_0 --> 2.2.7-h87f3376_4
# this solved it updating maxbin2 probably not needed
conda install -f -c conda-forge -c bioconda perl-lwp-simple

perl /mnt/RAID5/tools/miniconda3/envs/mudoger_env/dependencies/conda/envs/metawrap_env/bin/run_MaxBin.pl -h

prokka dependencies

Can't locate XML::simple ...

solved by updating prokka_env conda environment conda update --all

databases

GTDBTK_DATA_PATH not set

conda activate /mnt/RAID5/tools/miniconda3/envs/mudoger_env/dependencies/conda/envs/gtdbtk_env
conda env config vars set GTDBTK_DATA_PATH=./Databases/gtdbtk/release207_v2/

Virsorter setup

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/RAID5/tools/miniconda3/envs/mudoger_env/dependencies/conda/envs/virsorter2_env/db/group
conda activate /mnt/RAID5/tools/miniconda3/envs/mudoger_env/dependencies/conda/envs/virsorter2_env
virsorter setup -d /mnt/RAID5/tools/miniconda3/envs/mudoger_env/dependencies/conda/envs/virsorter2_env/db -j4

checkm

IOError: [Errno 2] No such file or directory: u'/mnt/RAID6/Databases/mudoger_databasescheckm/hmms/phylo.hmm'

add trailing slash to bin/databases.sh DATABASE_LOCATION

running

module 1 preprocessing

mudoger --module preprocess --meta metadata_small.tsv -o small_test -t 20 -m 1000

metawrap naming convention... leaving out -m parameter -> stuck at mudoger preprocess problems with gzipped read files?

module 2 - prokaryotes

mudoger --module prokaryotes --meta metadata_small.tsv -o small_test -t 20 

not enough ram for pplacer?

------------------------------------------------------------------------------------------------------------------------
-----            There is 10 RAM and 20 threads available, and each pplacer thread uses >40GB, so I will           -----
-----                                          use 0 threads for pplacer                                           -----
------------------------------------------------------------------------------------------------------------------------

checkm still runs, but is very slow, should I have specified -m in the command?

problem with pplacer during GTDBtk step

==> Step 1 of 9: Starting pplacer.Uncaught exception: Sys_error("/mnt/RAID6/Databases/mudoger_databases/gtdbtk/release207/split/backbone/pplacer/gtdbtk_package
_backbone.refpkg: No such file or directory")
Fatal error: exception Sys_error("/mnt/RAID6/Databases/mudoger_databases/gtdbtk/release207/split/backbone/pplacer/gtdbtk_package_backbone.refpkg: No such file 
or directory")

this directory is in release207/split/high/pplacer, also other files missing database version r207 should fit to gtdbtk version 2.1.1 gtdbtk test runs through succesfully

module 3 viruses

Here I tested the individual module with existing assemblies, so without running module1 and 2 prior.

vibrant final output file empty

when running the viruses module command separately:

bash -i /mnt/RAID5/tools/miniconda3/envs/mudoger_env/bin/mudoger-module-3.sh -1 test_1.fastq -2 test_2.fastq -a test_assembly.fa -o test_output -t 20

I had to adapt the script: /mnt/RAID5/tools/miniconda3/envs/mudoger_env/bin/mudoger-module-3-1_viral-investigation.sh from:

cat "$output_folder"/vibrant/VIBRANT_final_assembly/VIBRANT_phages_final_assembly/final_assembly.phages_combined.fna |
grep ">" | sed "s/_fragment_1//g;s/>//g"   > "$output_folder"/vibrant_filtered_data.txt

to this:

st=`basename $assembly`  
assembly_name=${st%.f*$}
if [ -f "$output_folder"/vibrant/VIBRANT_${assembly_name}/VIBRANT_phages_${assembly_name}/${assembly_name}.phages_combined.fna ];
[......]
cat "$output_folder"/vibrant/VIBRANT_${assembly_name}/VIBRANT_phages_${assembly_name}/${assembly_name}.phages_combined.fna | grep ">" | sed "s/_fragment_1//g;s/>//g"   > "$output_folder"/vibrant_filtered_data.txt

alternatively and probably better, I should have renamed the input assembly file to final_assembly.fa :)

misc

On large(r) assemblies virfinder and virsorter are really slow unfortunately (I tested 5.5M contigs, virsorter ~ 24h, virfinder 2.5M after 4 days then I stopped it). Would it be good to include a filtering step like before viral classification discarding short contigs (as Vibrant does) or those assigned to bacteria already? Or would this be included in the previous modules 1 and/or 2?

in one of the later stages of the viruses module I get this error repeatedly (probably for every contig): cat: /mnt/mudoger_workspace/2022/TESTS/test-5/SRR3138838/viruses/taxonomy/vcontact-output/genome_by_genome_overview.csv: No such file or directory this seems like an old path still included somewhere or it is related to all steps requiring output from module2 failing

JotaKas commented 2 years ago

Thank you very much for your effort, @mherold1.

We addressed some of your problems in version 1.0.1. We will systematically go through your issues and try to replicate the problems. A response will come as soon as possible.

mherold1 commented 2 years ago

thanks the only unsolved problem currently is with the gtdbtk step in module2

JotaKas commented 2 years ago

Great.

Version 1.0.1 added the new database version from GTDB-tk. If you delete your older gtdb-tk database and run the database-setup.sh script again, it should configure the new database.

You should have the gtdbtk/release207_v2/ in your system.

Let me know how it goes.

Thank you

LaizaFaria commented 2 months ago

Hi @JotaKas. I've been using MuDoGeR version 1.0.1, and it has been very practical so far. However, as mentioned above by @mherold1 , I'm encountering issues with the GTBD database installation. The folder for this database appears to be empty, whereas the other databases were installed without any problems. I installed MuDoGeR using Miniconda. Could you please advise on how to resolve this issue?

JotaKas commented 2 months ago

Hey @LaizaFaria,

I guess the quickest solution for you is simply to follow the instructions from the GTDB developers. The only Mudoger requirement is to have "gtdbtk/release207_v2/" (and the associated files to the release you are downloading).

Therefore, for release 214 you can

cd /path/to/your/database/folder
mkdir gtdbtk
cd ./gtdbtk
wget https://data.gtdb.ecogenomic.org/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz
tar xvzf gtdbtk_data.tar.gz

Then make sure the folder inside the gtdbtk folder has somehting like release###/

LaizaFaria commented 2 months ago

Thank you for your response! I noticed a more recent version of GTDB-Tk available, version 220. Would there be any issues with using this newer version?