iTaxoTools / TaxI2

Calculation and analysis of pairwise sequence distances
GNU General Public License v3.0
0 stars 0 forks source link

Add new program mode: NCBI Blast #22

Open mvences opened 3 years ago

mvences commented 3 years ago

This additional function is lower priority and should be dealt with after issues #20 and #21.

One additional functionality TaxI3 should offer is to compare a set of sequences (from the input file) online to the NCBI-Genbank reference data set (which comprise many millions of sequences) using the server's BLAST algorithm, and retrieve the best matches as well as their identification. In principle this process is rather easy, but there are several handicaps:

However, BLAST searches against this online database have many advantages and offer many important options, such as retrieving for a query sequence all geographic localities where this species may occur, and so on. So we should not totally omit it from TaxI3 as many users will expect such an option.

Maybe to start, this can be implemented in a very simple way without many options: take each sequence, submit it to the NCBI BLAST search, and retrieve only one (the first) hit that the database return, and print a simple output file with the basic information returned from the database.

I also suggest for this mode, building in a "blocker" that first counts the number of sequences in the input file, and only takes the first 100 sequences for submission to NCBI, issuing an error message "This process is very time consuming, and for now only allows comparing 100 sequences at once; the first 100 sequences from the input file are being used".

Once the implementation is successful, we can think about way to improve the output.

Probably the easiest way to implement this is using Biopython, see this link:

https://biopython.org/docs/dev/api/Bio.Blast.NCBIWWW.html

necrosovereign commented 2 years ago

As far as I can understand, NCBI Blast web API is deprecated. Apparently, they expect the developers to create their own copies of the databases using cloud providers (e.g Google, Amazon) and direct the requests to the copies.

mvences commented 2 years ago

Really? Oh, this is very understandable, probably people have kept sending very massive BLAST requests to their server and led to their servers getting very slow. OK, let me look into this ... we keep this issue open for now, but probably we will then eventually drop it.