qiyunlab / HGTector

HGTector2: Genome-wide prediction of horizontal gene transfer based on distribution of sequence homology patterns.
BSD 3-Clause "New" or "Revised" License
116 stars 34 forks source link

Problem in downloading database #126

Open Subhajeet1997 opened 1 year ago

Subhajeet1997 commented 1 year ago

I have used the command "hgtector database -o db_dir --default" to download the database. After downloading the protein files successfully. when downloading the genome files. it is showing following error Using local file GCF_963082495.1_Q8283_protein.faa.gz. Using local file GCF_963378075.1_MU0083_Flye_MinION_protein.faa.gz. Using local file GCF_963378095.1_MU0053_Flye_MinION.2_protein.faa.gz. Using local file GCF_963378105.1_MU0102_Flye_MinION_protein.faa.gz. Using local file GCF_963394915.1_CCUG_26878_T_protein.faa.gz. Done. Extracting downloaded genomic data...Killed what is the reason behind it??

qiyunzhu commented 1 year ago

Hi @Subhajeet1997 Thanks for reporting. I have not seen this problem before. It seems to be a problem outside HGTector's Python code. Perhaps it is because your gzip library isn't correctly installed in the computer. To debug, you may grab a downloaded .gz file (say, filename.gz), and attempt to open it using the Python code:

import gzip
f = gzip.open('filename.gz', 'rb')
print(f.read().decode().splitlines()[0])
f.close()

If you get the same error, then my guess is correct.

Subhajeet1997 commented 1 year ago

yes, i have tried to gzip a file using your script. it is showing following error Traceback (most recent call last): File "/home/sutripa/test_1/python.py", line 3, in print(f.read().decode().splitlines()[0]) ^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 2066: invalid start byte

Subhajeet1997 commented 1 year ago

But gzip is properly installed in my system. when i tried to unzip the same file with gzip -d "filename". it is easily unzipped

qiyunzhu commented 1 year ago

I see. The gzip program and Python may use different libraries. Perhaps the Python part is not right. It could also be that the gzipped file you tested is not a text file, causing the decoding error. Can you please try a text file? Alternatively, you can modify the line of code from print(f.read().decode().splitlines()[0]) into _ = f.read(). This will test whether it is the gzip library issue or the file content issue.

Subhajeet1997 commented 1 year ago

test.txt.gz import gzip f = gzip.open('test.txt.gz', 'rb') print(f.read().decode().splitlines()[0]) f.close() i have run this script to unzip the gzipped text file and it is running successfully. I have tried again "hgtector database -o hgtector_database --default --threads 50" still same error Using local file GCF_963394915.1_CCUG_26878_T_protein.faa.gz. Done. Extracting downloaded genomic data...Killed if I can't download the database by this way i will use the prebuilt recent database but can you give me the proper link from where i can download using wget command. Because the links provided in the github page, i cant understand properly. Please help me to run the tool. It is very essential for my analysis.

Subhajeet1997 commented 1 year ago

Hello, I can't download the database by default method. So, I have downloaded the pre-built database named "hgtdb_20230102" and unzip it. It contains "db.faa, genome.map.gz, genomes.tsv, lineages.txt, taxdump, taxon.map.gz" files. I have then tried to do manual database compilation using following command. echo $'accession.version\ttaxid' | cat - <(zcat taxon.map.gz) > prot.accession2taxid.FULL diamond makedb --threads 50 --in db.faa --taxonmap prot.accession2taxid.FULL --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db db it is showing following error: "Error: Invalid taxonomy mapping file format." Please help please

qiyunzhu commented 12 months ago

Hello @Subhajeet1997 Thanks for the follow-up. I just tried to compile the "hgtdb_20230102" database using DIAMOND v2.1.8 (the latest version), and it worked. I also tried to do it on the demo database "ref107" and it worked too. Therefore, I am afraid that I cannot reproduce the error you encountered. Which DIAMOND version did you use? If it's too old (like 0.7.x) there could be a problem. Otherwise, you perhaps can check the integrity of the downloaded database file. There is an MD5 checksum attached in the repository for you to do this check.

qiyunzhu commented 12 months ago

Also, I just built a small custom database using the hgtector database command, and didn't get the Killed error. I did some search and found that this error might be related to memory leak. I don't know how to handle this...

Subhajeet1997 commented 12 months ago

Yes, you are right, my diamond tool is of older version diamond v0.9.25.126. I will update the diamond and try to compile the database. But for now, I have compiled the database using makeblastdb, it is successfully compiled and I have run one search using blast. It is obviously slow compared to diamond, taking 2-2.5 days to run. So, I am waiting for the output. Hope I will get some results.

Subhajeet1997 commented 11 months ago

Hey, the blast run has successfully and got results. But I have another query what are default parameters for "--maxhits --evalue --identity --coverage ". As I run in default, is running in default mode acceptable?

qiyunzhu commented 11 months ago

Hi @Subhajeet1997 The default parameters are stored in config.yml:

  # search cutoffs
  maxseqs: 500        # maximum number of sequences to return
  evalue: 1.0e-5      # maximum E-value cutoff (note: keep decimal point)
  identity: 0         # minimum percent identity cutoff
  coverage: 0         # minimum percent query coverage cutoff

  # hits filtering
  maxhits: 0          # maximum number of hits to preserve (0 for unlimited)
kirtivel commented 11 months ago

Hello Prof. Zhu (@qiyunzhu ), I have a doubt regarding the creation of a custom database using specific taxa. I wanted to know the HGT genes in a Metabacillus strain and I presumed that I cannot download the entire repository of bacterial faa files. Hence, after installing hgtector 2.0, I ran the following code to exclude all taxa except Bacillota -

hgtector database -c bacteria -o db1 -t 1117,766,57723,201174,200783,67819,67818,976,1936987,3018035,67814,29547,1930617,204428,1090,200795,200938,2138240,200930,1297,68297,74152,65842,32066,142182,1134404,256845,544448,2818505,1293497,40117,203682,1224,1853220,203691,508458,200940,3027942,200918,74201 -e

The code for Bacillota is 1239 which is what I want to download. But even this is taking an awfully long time (approx. 13h). The download is happening without any error but it's too slow. Following are my system and Wifi details :

  1. Lenovo Ideapad, 16GB memory
  2. 12th Gen i5 - 1235U x 12 Processor
  3. 1 TB Disk capacity
  4. OS = Ubuntu 22.04.3 LTS
  5. Wifi speed = 32Mb/s

Do I require more disk space for this download? Or is there anything wrong with the code? If you think that my disk space is not enough could you suggest any other way to do this? Thank you.