How to create custom DataBase of de novo assembly?

maruiqi0710 commented 1 year ago

I de novo assembly a genome from fastq files and want to remove organelle genomes (mitochondria, chloroplasts, etc.) and plasmids genomes. How should I set up a custom database of organelles and plasmids genomes? The genomes of organelles and plasmids were also de novo assembly.

timkahlke commented 1 year ago

There are 2 possibilities:

Create a custom mapping DB sequenceID -> NCBI taxonomyID If you know the ncbi taxonomies of your assembled organelles/plastids you can create a tab-separated mapping file organelleID NCBI_taxonomyID and use the "create_db" command as described in the wiki here

Create new taxonomy db & and custom mapping db This is a more fiddly way in case you don't have the NCBI taxonomies and don't want to look for it. For this you have to first replace the NCBI taxonomy database with a new one, and then create a mapping file & database to map your sequence IDs to those in the new taxonomy database, which is usually just mapping them to themselves.

Create a file sequenceID Taxonomy string for all your assembled sequences. Then locate your BASTA folder (default is .basta in your home), and in the taxonomy directory rename the directory "complete_taxa.db" to something like "complete_taxa.db.orig" (this is the directory containing the actual NCBI taxonomies). After that, create a new database using the "create_db" command (see wiki link above) with the mapping file (ID->taxonomy) that you just created and DB type is "complete_taxa". This will create a new directory "complete_taxa.db" in your $BASTA/taxonomy directory.
Now you need to make a mapping file which is again a tab-separated file just with sequenceID sequenceID, i.e., both columns in the file are identical. Now create a second database (with the create_db command) and a type of your choice. This type will then be used as the mapping type parameter in your basta call.

maruiqi0710 commented 1 year ago

I did the following

I have prepared two tab-separated files:

seqid_Taxonomy.tsv

sequenceID  Taxonomy_string
seqid_1 mtDNA
seqid_2 mtDNA
seqid_3 mtDNA
seqid_4 mtDNA

seqid_seqid.tsv

sequenceID  sequenceID
seqid_1 seqid_1
seqid_2 seqid_2
seqid_3 seqid_3
seqid_4 seqid_4

I ran the following command:

basta create_db -d "$db_dir" seqid_Taxonomy.tsv complete_taxa 0 1
basta create_db -d "$db_dir" seqid_seqid.tsv "$db_type" 0 1

I ran the following command: basta sequence -d "$db_dir" "$blast_result" "$out_file" "$db_type"

I get the result:

Why are some of the results "mtDNA" and some "Unknown"?

timkahlke / BASTA

How to create custom DataBase of de novo assembly? #43