motu-tool / mOTUs

motus - a tool for marker gene-based OTU (mOTU) profiling
GNU General Public License v3.0
147 stars 25 forks source link

Own Database #47

Closed EllenOldenburg closed 4 years ago

EllenOldenburg commented 4 years ago

Is it possible to generate our own database if we want to use a very specific dataset with particular marker genes?

AlessioMilanese commented 4 years ago

Hi @EllenOldenburg,

Thanks for your interest in mOTUs!

In order to use your own genes, you need to change some file in the directory db_mOTU (created by setup.py).

Before going there, it need to be clear what is the difference between genes (often refer to MG for marker gene), marker gene clusters (MGCs) and mOTU:

AlessioMilanese commented 4 years ago

Here is a description of the files in the database:

File name Description needed to run motus profile?
README Information about the database no
db_mOTU_DB_CEN.fasta Padded gene sequences used for motus snv_call. It contains only the centroid sequences (so one sequence per MGC). no
db_mOTU_DB_CEN.fasta.amb Created by bwa index db_mOTU_DB_CEN.fasta no
db_mOTU_DB_CEN.fasta.ann Created by bwa index db_mOTU_DB_CEN.fasta no
db_mOTU_DB_CEN.fasta.annotations Custom file to run meta_snv no
db_mOTU_DB_CEN.fasta.bwt Created by bwa index db_mOTU_DB_CEN.fasta no
db_mOTU_DB_CEN.fasta.pac Created by bwa index db_mOTU_DB_CEN.fasta no
db_mOTU_DB_CEN.fasta.sa Created by bwa index db_mOTU_DB_CEN.fasta no
db_mOTU_DB_NR.fasta Padded gene sequences used for motus profile. NR stands for "non-redundant". It contains all the gene sequences used for mapping reads and then evaluating species abundances. yes
db_mOTU_DB_NR.fasta.amb Created by bwa index db_mOTU_DB_NR.fasta yes
db_mOTU_DB_NR.fasta.ann Created by bwa index db_mOTU_DB_NR.fasta yes
db_mOTU_DB_NR.fasta.bwt Created by bwa index db_mOTU_DB_NR.fasta yes
db_mOTU_DB_NR.fasta.pac Created by bwa index db_mOTU_DB_NR.fasta yes
db_mOTU_DB_NR.fasta.sa Created by bwa index db_mOTU_DB_NR.fasta yes
db_mOTU_MAP_MGCs_to_mOTUs.tsv File that map the MGCs to the mOTUs yes
db_mOTU_MAP_MGCs_to_mOTUs_in-line.tsv File that map the MGCs to the mOTUs (different format than db_mOTU_MAP_MGCs_to_mOTUs.tsv) yes
db_mOTU_MAP_genes_to_MGCs.tsv File that map genes to MGCs yes
db_mOTU_bam_header_CEN Header of the SAM file that is produced by bwa mem on db_mOTU_DB_CEN.fasta no
db_mOTU_bam_header_NR Header of the SAM file that is produced by bwa mem on db_mOTU_DB_NR.fasta yes
db_mOTU_genes_length_NR File with info about the gene length yes
db_mOTU_padding_coordinates_CEN.tsv Padding coordinates for the genes in db_mOTU_DB_NR.fasta no
db_mOTU_padding_coordinates_NR.tsv Padding coordinates for the genes in db_mOTU_DB_CEN.fasta yes
db_mOTU_taxonomy_CAMI.tsv Taxonomy file used for CAMI output only for -C
db_mOTU_taxonomy_meta-mOTUs.tsv Taxonomy of the meta-mOTUs yes
db_mOTU_taxonomy_ref-mOTUs.tsv Taxonomy of the ref-mOTUs yes
db_mOTU_taxonomy_ref-mOTUs_short_names.tsv Taxonomy of the ref-mOTUs (specify short version of the species name) yes
db_mOTU_test Directory with test samples used by motus profile --test no
db_mOTU_versions Version of database and scripts no
AlessioMilanese commented 4 years ago

Some notes:

AlessioMilanese commented 4 years ago

I will close the issue. Fell free to re-open it if you need more help.

shihuang047 commented 3 years ago

Hello Alessio, I'm now trying to construct the custom mOTU2 database with a reference genome database with 170k microbial genomes (including bacteria, archaea, fungi).

  1. Fetch markers (MGs) from I did the marker extraction using fetchMG.pl. For each genome, we got a set of fna/faa files (N=40). Now we can concatenate them into one fna/faa per genome. Right?
  2. We don't quite understand how to derive the mOTUs based on the MGS. Do we have to do the same things described in the paper? For example, we calculate pairwise global nucleotide identities for all genome for each of the 40 MGs using vsearch and calculate the genome-to-genome distance for generating mOTUs. Furthermore, assign the taxonomic annotations to the representative genome in each mOTU.
  3. Thanks for giving us a concrete protocol to update the db_mOTU_DB folder. But I still don't quite understand if a customized db_mOTU_DB_NR.fasta can be generated using the protocol above. Can you provide any practical codes that can be used for this purpose? Thank you so much!!

Shi

AlessioMilanese commented 3 years ago

Hi Shi,

I think this would make things easier: https://github.com/AlessioMilanese/read_counter

Fetch markers (MGs) from I did the marker extraction using fetchMG.pl. For each genome, we got a set of fna/faa files (N=40). Now we can concatenate them into one fna/faa per genome. Right?

You have to concatenate all fna files into only one file. Pay attention to have proper names for the headers of the fasta file.

Another note about:

with 170k microbial genomes (including bacteria, archaea, fungi)

You might want to check the fungi, because the 40 MGs would probably not be called there (since fetchMG is designed to find genes in Bacteria and Archaea)