Open maruiqi0710 opened 1 year ago
There are 2 possibilities:
Create a custom mapping DB sequenceID -> NCBI taxonomyID
If you know the ncbi taxonomies of your assembled organelles/plastids you can create a tab-separated mapping file
organelleID NCBI_taxonomyID
and use the "create_db" command as described in the wiki here
Create new taxonomy db & and custom mapping db This is a more fiddly way in case you don't have the NCBI taxonomies and don't want to look for it. For this you have to first replace the NCBI taxonomy database with a new one, and then create a mapping file & database to map your sequence IDs to those in the new taxonomy database, which is usually just mapping them to themselves.
Create a file sequenceID Taxonomy string
for all your assembled sequences. Then locate your BASTA folder (default is .basta in your home), and in the taxonomy directory rename the directory "complete_taxa.db" to something like "complete_taxa.db.orig" (this is the directory containing the actual NCBI taxonomies). After that, create a new database using the "create_db" command (see wiki link above) with the mapping file (ID->taxonomy) that you just created and DB type is "complete_taxa". This will create a new directory "complete_taxa.db" in your $BASTA/taxonomy directory.
Now you need to make a mapping file which is again a tab-separated file just with sequenceID sequenceID
, i.e., both columns in the file are identical. Now create a second database (with the create_db command) and a type of your choice. This type will then be used as the mapping type parameter in your basta call.
I did the following
seqid_Taxonomy.tsv
sequenceID Taxonomy_string
seqid_1 mtDNA
seqid_2 mtDNA
seqid_3 mtDNA
seqid_4 mtDNA
seqid_seqid.tsv
sequenceID sequenceID
seqid_1 seqid_1
seqid_2 seqid_2
seqid_3 seqid_3
seqid_4 seqid_4
I ran the following command:
basta create_db -d "$db_dir" seqid_Taxonomy.tsv complete_taxa 0 1
basta create_db -d "$db_dir" seqid_seqid.tsv "$db_type" 0 1
I ran the following command:
basta sequence -d "$db_dir" "$blast_result" "$out_file" "$db_type"
I get the result:
Why are some of the results "mtDNA" and some "Unknown"?
I de novo assembly a genome from fastq files and want to remove organelle genomes (mitochondria, chloroplasts, etc.) and plasmids genomes. How should I set up a custom database of organelles and plasmids genomes? The genomes of organelles and plasmids were also de novo assembly.