timkahlke / BASTA

Basic Sequence Taxonomy Annotator
GNU General Public License v3.0
38 stars 13 forks source link

How to create custom DataBase of de novo assembly? #43

Open maruiqi0710 opened 1 year ago

maruiqi0710 commented 1 year ago

I de novo assembly a genome from fastq files and want to remove organelle genomes (mitochondria, chloroplasts, etc.) and plasmids genomes. How should I set up a custom database of organelles and plasmids genomes? The genomes of organelles and plasmids were also de novo assembly.

timkahlke commented 1 year ago

There are 2 possibilities:

Create a custom mapping DB sequenceID -> NCBI taxonomyID If you know the ncbi taxonomies of your assembled organelles/plastids you can create a tab-separated mapping file organelleID NCBI_taxonomyID and use the "create_db" command as described in the wiki here

Create new taxonomy db & and custom mapping db This is a more fiddly way in case you don't have the NCBI taxonomies and don't want to look for it. For this you have to first replace the NCBI taxonomy database with a new one, and then create a mapping file & database to map your sequence IDs to those in the new taxonomy database, which is usually just mapping them to themselves.

  1. Create a file sequenceID Taxonomy string for all your assembled sequences. Then locate your BASTA folder (default is .basta in your home), and in the taxonomy directory rename the directory "complete_taxa.db" to something like "complete_taxa.db.orig" (this is the directory containing the actual NCBI taxonomies). After that, create a new database using the "create_db" command (see wiki link above) with the mapping file (ID->taxonomy) that you just created and DB type is "complete_taxa". This will create a new directory "complete_taxa.db" in your $BASTA/taxonomy directory.

  2. Now you need to make a mapping file which is again a tab-separated file just with sequenceID sequenceID, i.e., both columns in the file are identical. Now create a second database (with the create_db command) and a type of your choice. This type will then be used as the mapping type parameter in your basta call.

maruiqi0710 commented 1 year ago

I did the following

  1. I have prepared two tab-separated files:

seqid_Taxonomy.tsv

sequenceID  Taxonomy_string
seqid_1 mtDNA
seqid_2 mtDNA
seqid_3 mtDNA
seqid_4 mtDNA

seqid_seqid.tsv

sequenceID  sequenceID
seqid_1 seqid_1
seqid_2 seqid_2
seqid_3 seqid_3
seqid_4 seqid_4
  1. I ran the following command:

    basta create_db -d "$db_dir" seqid_Taxonomy.tsv complete_taxa 0 1
    basta create_db -d "$db_dir" seqid_seqid.tsv "$db_type" 0 1
  2. I ran the following command: basta sequence -d "$db_dir" "$blast_result" "$out_file" "$db_type"

I get the result: image

Why are some of the results "mtDNA" and some "Unknown"?