microgenomics / pasteTaxID

This script take your fastas, search for common IDs (ti, gi, gb, emb), get the ti (or gi if is missing), and finally put the ID's in the same fasta
GNU General Public License v2.0
6 stars 4 forks source link

pasteTaxID


Welcome to the final solution to a lot of headaches :D This script will take a multifasta file (or many individual fasta files), then search for common IDs (acc, ti, gi, emb, gb), and append the corresponding ti to fasta entries at high speed (10000 entries in just 4 minutes!, depending on your internet connection and if you have a NCBI API KEY).

pasteTaxID can take large multifasta files (we have tried more than 150000 without any problems), avoiding collapse your system and saving a lot of time!

Requirements

Usage

There are two ways of running the script. If you have a directory with individual fasta files, then use the following strategy:

Simple way

bash pasteTaxID.bash --workdir [directory_fastas]

However, if you have a multifasta file, the appropriate command line is:

bash pasteTaxID.bash --multifasta [multifasta_file]

where --workdir is a directory where your fasta files are located and --multifasta is the multifasta file (.fna, .fn or .fasta works too).

In the example folder there are some fasta files that don't contain a gi or a ti, just gb. Try testing the script by issuing

bash PasteTaxID.bash --workdir example

Wait a few seconds and check the fastas again. Now you should see taxonomy id's and gi's appended to the fasta entries.

Complete way

Additionally there is a complete way (two more parameters). If your default python is not 2.7, you can add --pythonBin and pass the full path to your python 2.7.

bash pasteTaxID.bash --multifasta myfasta.fasta --pythonBin /home/Peter/programs/python2.7/bin/python

And finally, you can set the number of parallel process to improve the fetch speed (Max jobs: your cores). Don't forget to create your NCBI api key (see requirements section)

bash pasteTaxID.bash --multifasta myfasta.fasta --parallelJobs 8 --apikey 1a2b3c4d56788xyz

This fetch 8 IDs at the same time (max number you can set: same number of all your cores).

Notes

External useful tools

check for these tools to extract some useful information from your data: