qiyunlab / HGTector

HGTector2: Genome-wide prediction of horizontal gene transfer based on distribution of sequence homology patterns.
BSD 3-Clause "New" or "Revised" License
116 stars 34 forks source link

HGT sources and GTDB #36

Closed Gu2uMo closed 3 years ago

Gu2uMo commented 4 years ago

HI, I just notice HGTector has come to the version2 and I install and run successfully. But I have a question that could it give the source of the HGT? I mean those genes may come from some certain order, family and any other level.

What's more, can we just use the result of 'hgtector search', convert the protID to taxID to see which species those genes come from?

Thanks for your help and this update. It is really easy to install and use than before.

qiyunzhu commented 4 years ago

Hello @Gu2uMo Whoa I didn't expect that the new version got some attention so quickly! And I really appreciate that you tested it and told me it worked!

Finding the source of HGT is certainly possible. Technically, yes, you can use the search result, which already has TaxIDs (last column). The best hit from the "distal" group of each candidate gene is a quick implication of its source. I can add this function to the program.

qiyunzhu commented 4 years ago

@Gu2uMo Okay I added this function. The donor taxonomy is now appended to scores.tsv. It is not just the best hit in the "distal" group, but the LCA of top several "distal" hits.

Gu2uMo commented 4 years ago

Thanks~ just the time I struggle with the version 1, I found it updated. and I have one more question,

My focus, Pseudomonas is a very confusing genus and its classification in nr database is not so correct, whether on ANI or phylogenetic trees of conserved genes. Would this influence hgtector's accuracy?

In detail and in further, While I use another tools, gtdbtk to do the classification.

https://gtdb.ecogenomic.org/ Parks, D. H., et al. (2018). "A standardized bacterial taxonoThis my based on genome phylogeny substantially revises the tree of life." Nature Biotechnology.

Its result seems reliable, I mean, according to the tree and ANI value. But then comes a question that I can't use the taxonomy information in nr database to infer the HGT because it would be conflict.

Then I want to know which files in your pre-build database matters? I know db.faa, taxdump, prot2taxid. Is there anything else?

qiyunzhu commented 4 years ago

Hi @Gu2uMo I understand your situation. I am also aware that Pseudomonas is highly confusing. If you use HGTector with the native NCBI taxonomy, you may to specify the grouping scenario which is consistent with the knowledge of phylogeny. For example, --close-tax 123,456,789 will instruction the program to consider TaxIDs 123, 456 and 789 as the same (monophyletic) group.

Using GTDB seems a plausible plan. You may need to convert GTDB into a format that resembles NCBI taxdump. Both taxdump and prot2taxid need to be modified somehow. Note that TaxID does not have to be numeric. Although I have not tested, I guess it should work if you supply the original GTDB taxon names.

qiyunzhu commented 4 years ago

Hi @Gu2uMo and anyone who plans to use GTDB: I have figured out a solution to convert GTDB taxonomy into NCBI style, which can directly feed into HGTector. These files are generated based on GTDB release 89. I will update the repository once tested and working.

names.dmp.gz nodes.dmp.gz taxmap.txt.gz

Gu2uMo commented 4 years ago

Really surprised to get help after such a long time. Thanks a lot and I will try it again.

qiyunzhu commented 4 years ago

Hi @Gu2uMo Thanks for your interest! I just 1) had spare time and 2) realized the popularity of GTDB as other people also requested. So I am sitting down to get that stuff in.

qiyunzhu commented 4 years ago

New release 2.0b2 is now online, with instructions for using GTDB added.