phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Apache License 2.0
125 stars 33 forks source link

Question for the "--genome_filter_db_prefix (-g)" flag #69

Closed iaposto closed 3 years ago

iaposto commented 4 years ago

Hello,

I downloaded the prebuilt database of closed Enterobacteriaceae genomes to use with the --genome_filter_db_prefix (-g) flag for reconstructing E. coli plasmids. As per the example in README.md I used "-g /2019-11-NCBI-Enterobacteriacea-Chromosomes/2019-11-NCBI-Enterobacteriacea-Chromosomes.fasta" in my script. In the program's output I get: "Genome filter sequences provided" followed by "No close genome matches found". My questions are:

  1. How is it possible to get no close genomes since my input are E. coli genomes?
  2. This fasta file seems to be rather small compared to the whole database size. Do we have to use this database in another way?

Thanks in advance, Ilias

jrober84 commented 4 years ago

The genome filter is meant to be more restrictive in the application by default and looks for genomes within a mash distance of 0.002. You can increase this to be something like 0.05 if you want to be more permissive. The fasta file included there shouldn't be there and it is not needed for the tool to run. The blast database is all that is needed and if you want the original fasta files you can regenerate it from the blast indexes. I will update the archive to remove the erroneous fasta file.

iaposto commented 4 years ago

Thanks for your reply, if the fasta file is not necessary to run the tool then to which file do I have to point the -g flag for the tool to use the prebuilt database? There are 7 .nsq files

jrober84 commented 3 years ago

Sorry for the late reply. The -g flag would just need to include the path and prefix to the databases. So the command you listed above is all that is needed