qiyunlab / HGTector

HGTector2: Genome-wide prediction of horizontal gene transfer based on distribution of sequence homology patterns.
BSD 3-Clause "New" or "Revised" License
128 stars 35 forks source link

Getting one hit per protein(itself) #23

Open chenziwu5 opened 5 years ago

chenziwu5 commented 5 years ago

Hi qiyun: I am running the my own genome with local blast nr database but i am getting just one hit per protein that too the same one with 100 percent identity. and i tested sample files to get the same results. I also try to change the stdb database that contains one representative per species from all available nonredundant RefSeq prokaryotic proteomes. Unfortunately, the same results have emerged. Here is my contig: selfTax=REE1:651137 searchTool=BLAST blastp=blastp blastdbcmd=blastdbcmd protdb=/share/home/yuanqingc/db/HGTector/stdb taxdump=/share/home/yuanqingc/db/HGTector/taxdump prot2taxid=/share/home/yuanqingc/db/HGTector/prot2taxid.txt threads=38 queries=38 identity=30 coverage=50 maxHits=100

And i also try to provide pre-computed search results. The same results have emerged. Any idea where the problem might be?waitting on the line : ) Thanks

qiyunzhu commented 5 years ago

Hello @chenziwu5 Sorry for the late reply! I have seen similar instances previously. Reasons are different. Let me guess. Did you use databaser.py to build the database? One possibility is that the newer ncbi-blast+ has changed its way of building database. You may manually perform a test blastp search. Take a look at the output file (the -outfmt 8 format table). How do the subject IDs look like? Something like NP_123456.1 or ref|NP_123456.1? If later, then HGTector won't pick it.

The remedy is to remove that ref| thing from the pre-computed BLASTp result. Alternatively, use DIAMOND instead of BLASTp. Actually, DIAMOND is much faster and similarily accurate compared to BLASTp, so recommended. It doesn't have the same issue. Hope this helps!

Yliub commented 3 years ago

@chenziwu5 Hi,when I was running the HGTector 0.2.2 to predicted HGT envents of my 30 genomes, I occured the same problems you metioned above, could you please tell have you solved this problems? and what's wrong with it? thanks very much