Closed EllenOldenburg closed 4 years ago
Hi @EllenOldenburg,
Thanks for your interest in mOTUs!
In order to use your own genes, you need to change some file in the directory db_mOTU
(created by setup.py
).
Before going there, it need to be clear what is the difference between genes (often refer to MG for marker gene), marker gene clusters (MGCs) and mOTU:
Here is a description of the files in the database:
File name | Description | needed to run motus profile ? |
---|---|---|
README | Information about the database | no |
db_mOTU_DB_CEN.fasta | Padded gene sequences used for motus snv_call . It contains only the centroid sequences (so one sequence per MGC). |
no |
db_mOTU_DB_CEN.fasta.amb | Created by bwa index db_mOTU_DB_CEN.fasta |
no |
db_mOTU_DB_CEN.fasta.ann | Created by bwa index db_mOTU_DB_CEN.fasta |
no |
db_mOTU_DB_CEN.fasta.annotations | Custom file to run meta_snv |
no |
db_mOTU_DB_CEN.fasta.bwt | Created by bwa index db_mOTU_DB_CEN.fasta |
no |
db_mOTU_DB_CEN.fasta.pac | Created by bwa index db_mOTU_DB_CEN.fasta |
no |
db_mOTU_DB_CEN.fasta.sa | Created by bwa index db_mOTU_DB_CEN.fasta |
no |
db_mOTU_DB_NR.fasta | Padded gene sequences used for motus profile . NR stands for "non-redundant". It contains all the gene sequences used for mapping reads and then evaluating species abundances. |
yes |
db_mOTU_DB_NR.fasta.amb | Created by bwa index db_mOTU_DB_NR.fasta |
yes |
db_mOTU_DB_NR.fasta.ann | Created by bwa index db_mOTU_DB_NR.fasta |
yes |
db_mOTU_DB_NR.fasta.bwt | Created by bwa index db_mOTU_DB_NR.fasta |
yes |
db_mOTU_DB_NR.fasta.pac | Created by bwa index db_mOTU_DB_NR.fasta |
yes |
db_mOTU_DB_NR.fasta.sa | Created by bwa index db_mOTU_DB_NR.fasta |
yes |
db_mOTU_MAP_MGCs_to_mOTUs.tsv | File that map the MGCs to the mOTUs | yes |
db_mOTU_MAP_MGCs_to_mOTUs_in-line.tsv | File that map the MGCs to the mOTUs (different format than db_mOTU_MAP_MGCs_to_mOTUs.tsv ) |
yes |
db_mOTU_MAP_genes_to_MGCs.tsv | File that map genes to MGCs | yes |
db_mOTU_bam_header_CEN | Header of the SAM file that is produced by bwa mem on db_mOTU_DB_CEN.fasta |
no |
db_mOTU_bam_header_NR | Header of the SAM file that is produced by bwa mem on db_mOTU_DB_NR.fasta |
yes |
db_mOTU_genes_length_NR | File with info about the gene length | yes |
db_mOTU_padding_coordinates_CEN.tsv | Padding coordinates for the genes in db_mOTU_DB_NR.fasta |
no |
db_mOTU_padding_coordinates_NR.tsv | Padding coordinates for the genes in db_mOTU_DB_CEN.fasta |
yes |
db_mOTU_taxonomy_CAMI.tsv | Taxonomy file used for CAMI output | only for -C |
db_mOTU_taxonomy_meta-mOTUs.tsv | Taxonomy of the meta-mOTUs | yes |
db_mOTU_taxonomy_ref-mOTUs.tsv | Taxonomy of the ref-mOTUs | yes |
db_mOTU_taxonomy_ref-mOTUs_short_names.tsv | Taxonomy of the ref-mOTUs (specify short version of the species name) | yes |
db_mOTU_test | Directory with test samples used by motus profile --test |
no |
db_mOTU_versions | Version of database and scripts | no |
Some notes:
motus profile
is not using this file, but it need to be present for the computation (it can be empty);db_mOTU_taxonomy_meta-mOTUs.tsv
can be empty.I will close the issue. Fell free to re-open it if you need more help.
Hello Alessio, I'm now trying to construct the custom mOTU2 database with a reference genome database with 170k microbial genomes (including bacteria, archaea, fungi).
Shi
Hi Shi,
I think this would make things easier: https://github.com/AlessioMilanese/read_counter
Fetch markers (MGs) from I did the marker extraction using fetchMG.pl. For each genome, we got a set of fna/faa files (N=40). Now we can concatenate them into one fna/faa per genome. Right?
You have to concatenate all fna files into only one file. Pay attention to have proper names for the headers of the fasta file.
Another note about:
with 170k microbial genomes (including bacteria, archaea, fungi)
You might want to check the fungi, because the 40 MGs would probably not be called there (since fetchMG is designed to find genes in Bacteria and Archaea)
Is it possible to generate our own database if we want to use a very specific dataset with particular marker genes?