Own Database - Githubissues

EllenOldenburg commented 4 years ago

Is it possible to generate our own database if we want to use a very specific dataset with particular marker genes?

AlessioMilanese commented 4 years ago

Hi @EllenOldenburg,

Thanks for your interest in mOTUs!

In order to use your own genes, you need to change some file in the directory db_mOTU (created by setup.py).

Before going there, it need to be clear what is the difference between genes (often refer to MG for marker gene), marker gene clusters (MGCs) and mOTU:

a MG or gene, refer to the sequence of only one gene (from one genome).
a MGC is a cluster of genes,
a mOTU is a cluster of MGCs. Note that there are maximum 10 MGCs into one mOTU (one per each COG "family").

AlessioMilanese commented 4 years ago

Here is a description of the files in the database:

File name	Description	needed to run `motus profile`?
README	Information about the database	no
db_mOTU_DB_CEN.fasta	Padded gene sequences used for `motus snv_call`. It contains only the centroid sequences (so one sequence per MGC).	no
db_mOTU_DB_CEN.fasta.amb	Created by `bwa index db_mOTU_DB_CEN.fasta`	no
db_mOTU_DB_CEN.fasta.ann	Created by `bwa index db_mOTU_DB_CEN.fasta`	no
db_mOTU_DB_CEN.fasta.annotations	Custom file to run `meta_snv`	no
db_mOTU_DB_CEN.fasta.bwt	Created by `bwa index db_mOTU_DB_CEN.fasta`	no
db_mOTU_DB_CEN.fasta.pac	Created by `bwa index db_mOTU_DB_CEN.fasta`	no
db_mOTU_DB_CEN.fasta.sa	Created by `bwa index db_mOTU_DB_CEN.fasta`	no
db_mOTU_DB_NR.fasta	Padded gene sequences used for `motus profile`. `NR` stands for "non-redundant". It contains all the gene sequences used for mapping reads and then evaluating species abundances.	yes
db_mOTU_DB_NR.fasta.amb	Created by `bwa index db_mOTU_DB_NR.fasta`	yes
db_mOTU_DB_NR.fasta.ann	Created by `bwa index db_mOTU_DB_NR.fasta`	yes
db_mOTU_DB_NR.fasta.bwt	Created by `bwa index db_mOTU_DB_NR.fasta`	yes
db_mOTU_DB_NR.fasta.pac	Created by `bwa index db_mOTU_DB_NR.fasta`	yes
db_mOTU_DB_NR.fasta.sa	Created by `bwa index db_mOTU_DB_NR.fasta`	yes
db_mOTU_MAP_MGCs_to_mOTUs.tsv	File that map the MGCs to the mOTUs	yes
db_mOTU_MAP_MGCs_to_mOTUs_in-line.tsv	File that map the MGCs to the mOTUs (different format than `db_mOTU_MAP_MGCs_to_mOTUs.tsv`)	yes
db_mOTU_MAP_genes_to_MGCs.tsv	File that map genes to MGCs	yes
db_mOTU_bam_header_CEN	Header of the SAM file that is produced by `bwa mem` on `db_mOTU_DB_CEN.fasta`	no
db_mOTU_bam_header_NR	Header of the SAM file that is produced by `bwa mem` on `db_mOTU_DB_NR.fasta`	yes
db_mOTU_genes_length_NR	File with info about the gene length	yes
db_mOTU_padding_coordinates_CEN.tsv	Padding coordinates for the genes in `db_mOTU_DB_NR.fasta`	no
db_mOTU_padding_coordinates_NR.tsv	Padding coordinates for the genes in `db_mOTU_DB_CEN.fasta`	yes
db_mOTU_taxonomy_CAMI.tsv	Taxonomy file used for CAMI output	only for `-C`
db_mOTU_taxonomy_meta-mOTUs.tsv	Taxonomy of the meta-mOTUs	yes
db_mOTU_taxonomy_ref-mOTUs.tsv	Taxonomy of the ref-mOTUs	yes
db_mOTU_taxonomy_ref-mOTUs_short_names.tsv	Taxonomy of the ref-mOTUs (specify short version of the species name)	yes
db_mOTU_test	Directory with test samples used by `motus profile --test`	no
db_mOTU_versions	Version of database and scripts	no

AlessioMilanese commented 4 years ago

Some notes:

when I say "no" in the third column of the previous comment, I mean that motus profile is not using this file, but it need to be present for the computation (it can be empty);
if you extract genes from genomes, than you can avoid to have meta-mOTUs, and hence the file db_mOTU_taxonomy_meta-mOTUs.tsv can be empty.

AlessioMilanese commented 4 years ago

I will close the issue. Fell free to re-open it if you need more help.

shihuang047 commented 3 years ago

Hello Alessio, I'm now trying to construct the custom mOTU2 database with a reference genome database with 170k microbial genomes (including bacteria, archaea, fungi).

Fetch markers (MGs) from I did the marker extraction using fetchMG.pl. For each genome, we got a set of fna/faa files (N=40). Now we can concatenate them into one fna/faa per genome. Right?
We don't quite understand how to derive the mOTUs based on the MGS. Do we have to do the same things described in the paper? For example, we calculate pairwise global nucleotide identities for all genome for each of the 40 MGs using vsearch and calculate the genome-to-genome distance for generating mOTUs. Furthermore, assign the taxonomic annotations to the representative genome in each mOTU.
Thanks for giving us a concrete protocol to update the db_mOTU_DB folder. But I still don't quite understand if a customized db_mOTU_DB_NR.fasta can be generated using the protocol above. Can you provide any practical codes that can be used for this purpose? Thank you so much!!

Shi

AlessioMilanese commented 3 years ago

Hi Shi,

I think this would make things easier: https://github.com/AlessioMilanese/read_counter

Fetch markers (MGs) from I did the marker extraction using fetchMG.pl. For each genome, we got a set of fna/faa files (N=40). Now we can concatenate them into one fna/faa per genome. Right?

You have to concatenate all fna files into only one file. Pay attention to have proper names for the headers of the fasta file.

Another note about:

with 170k microbial genomes (including bacteria, archaea, fungi)

You might want to check the fungi, because the 40 MGs would probably not be called there (since fetchMG is designed to find genes in Bacteria and Archaea)

motu-tool / mOTUs

Own Database #47