Open bheimbu opened 1 year ago
Hi @hbckleikamp,
I've adjusted your python script and it now runs. ChatGPT helped me to debug your code. Here is the adjusted GTDB_protein_download.py script, with progress status now.
Cheers @bheimbu
Ah sorry bheimbu for the late response, I will update your suggestions. I think the issue could be related to the hard-coded urls: meta_urls=["https://data.gtdb.ecogenomic.org/releases/latest/ar122_taxonomy.tsv.gz", "https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz"] GTDB updates the names of these files regularly with new versions, so these names are not automatically updated. You could change the names manually from this website: https://data.gtdb.ecogenomic.org/releases/latest/
But I will also add something that detects the right files automatically. Kind regards and happy to help with further questions, Hugo.
Hi,
it is actually related to the correct import of urllib in python3. So you simply have to change this line of your code import urllib, gzip, zipfile, shutil, tarfile
to import urllib.request, gzip, zipfile, shutil, tarfile
and it works.
Cheers @bheimbu
Thanks, I added your corrections. Also there is a small update in: GTDB_protein_LCA.py GTDB_protein_download.py To work for current GTDB versions as well.
Thanks,
I'll try.
Hi,
you have to include from tqdm import tqdm
and import requests
to GTDB_protein_download.py to make it work. Otherwise the progress status is not shown properly and the code gives "retry retry...".
Thank you, I added it.
Hi,
just to let you know, the script is still somehow wrong. It restarts the download again and again; never decompressing the files. I will check this again tmrrw.
Cheers Bastian
Ok sorry about that Bastian, I have an updated code for downloading and extraction in another repository https://github.com/hbckleikamp/CHEW/blob/main/Setup/4_Download_unclustered_database.py I will look to adjust the update the code with this function and your progress bar but I can only look at that on Thursday, because I am moving to another country.
No worries,
I've downloaded and extracted the files by hand, now GTDB_protein_rename.py
is running.
Cheers Bastian
ps: Move save (I'm not sure if you actually say that in English ;) )
Hi,
can you tell me how to run GTDB_protein_LCA.py
? I'm not sure what kind of input is required. And what the command line should look like?
Cheers Bastian
Hey Bastian, These are some very simple sets of scripts, there was no command line integration made.
Did you run a DIAMOND alignment? There are 2 steps in between the python scripts that are executed with DIAMOND. "" -4. Construction of DIAMOND database from output of 3. (diamond --makedb, see: https://github.com/bbuchfink/diamond/wiki) -5. Annotation of proteins with DIAMOND database constructed in 4. (diamond --blastp, see: https://github.com/bbuchfink/diamond/wiki) "" Completing these two steps will give you an output .tsv file. The path to this file can be copy pasted into the script GTDB_protein_LCA.py
See parameters below:
input_filepath = "diamond_output.tsv" #placeholder filepath for input file
I hope this is clear. Kind regards, Hugo.
Ok, I see. I'll let you know whether it worked or not. Do I have to use a specific command for diamond blastp
or simply diamond blastp -d reference -q queries.fasta -o matches.tsv
?
Cheers
Yes the latter would be fine, it is coded to operate with default parameters. You could use any other parameters here if you want, there are many options in DIAMOND to suit your needs.
The only requirments are: -The output needs to contain at least the following columns: "sseqid","qseqid","bitscore"
-And the diamond_output_columns are manually supplied in the following variable: diamond_output_columns=["qseqid", "sseqid","pident", "length", "mismatch", "gapopen" , "qstart","qend","sstart","send", "evalue","bitscore"] #default diamond output collumns
So if you change the output format, you should also change this variable in the script. Kind regards, Hugo.
I've build my GTDB database like this:
diamond makedb --threads 40 --in ../../GTDB2DIAMOND/GTDB/GTDB_merged/GTDB_merged.fasta -d gtdb --taxonmap gtdb/prot.accession2taxid.FULL.gz --taxonnodes gtdb/nodes.dmp --taxonnames gtdb/names.dmp
.
Ok I have not run DIAMOND myself using the taxonmap annotation, I am not sure how this would interact with gtdb, since this is built around ncbi sequences. For this GTDB annotation specifically I would recommend leaving that section out, since I don't think the GTDB headers will link to the taxonmap, but you can always try.
So: diamond makedb --threads 40 --in ../../GTDB2DIAMOND/GTDB/GTDB_merged/GTDB_merged.fasta -d gtdb
You may be right. Because when I use my built database and following line of code: diamond blastx --db databases/gtdb.dmnd --query {input.faa} --outfmt 102 --out blastx/{wildcards.sample}_blastx.txt --max-hsps 0 --evalue 1e-24 --threads {params.threads}
This command throws an error if the database is built without taxonnames and taxonnodes. But gives me only zeros as output (no sequences are classified).
Or maybe this is rather related to the fact that I use blastx instead of blastp, I dont know? I'm a total newbie to functional annotation, actually. I'm more or less following this approach here, despite the fact that the scripts have several shortcomings.
Ah you use blastx ok, I have no experience with that. Let's see what results you get and if they are compatible with the output columns I mentioned above.
Ok. I let you know asap.
Ok, I've copied the blast output to the GTDB2DIAMOND directory and changed the output name to diamond_output.tsv
. Now the script throws an error out about a missing folder calledGTDB-metadata
?! See here:
$ python GTDB_protein_LCA.py
/home//microbiome/functional_annotation/GTDB2DIAMOND
Traceback (most recent call last):
File "/home/microbiome/functional_annotation/GTDB2DIAMOND/GTDB_protein_LCA.py", line 26, in <module>
metadata_filepaths=[str(Path(basedir,"GTDB-metadata",i)) for i in os.listdir(str(Path(basedir,"GTDB-metadata"))) if i.endswith("_taxonomy.tsv")]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/microbiome/functional_annotation/GTDB2DIAMOND/GTDB-metadata'
Cheers Bastian
Yes this should normally have been downloaded with the download script as well, but it is not built since you downloaded the sequences manually.
These are the urls you can use to download the metadata. https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv Then you can put them in the appropriate folder: '/home/microbiome/functional_annotation/GTDB2DIAMOND/GTDB-metadata'
I see. I will give it a try.
Ok, it works, but the results have no resembles to diamond blastp --outfmt 102 ...
, which looks like this
scan3 0 0
scan7 0 0
scan9 0 0
scan13 0 0
I don't know how to go on... I'm trying to build the gtdb database with another program, hoping that this will solve my problems with --outfmt 102
.
As I mentioned above: """ The only requirments are: -The output needs to contain at least the following columns: "sseqid","qseqid","bitscore"
-And the diamond_output_columns are manually supplied in the following variable: diamond_output_columns=["qseqid", "sseqid","pident", "length", "mismatch", "gapopen" , "qstart","qend","sstart","send", "evalue","bitscore"] #default diamond output collumns
So if you change the output format, you should also change this variable in the script. """
Please study the documentation of DIAMOND: https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options
Output format 102 will give you an NCBI classification, which will not work because this is a GTDB database not an NCBI database. You can use the default DIAMOND output format (6) if you want to use this tool.
Ah ok. I see.
Hi @hbckleikamp,
I'm eager to use your scripts to build a diamond database, but when I use
python GTDB_protein_download.py
, I only getretry retry retry
. And nothing more happens...Can you help me here please,
Cheers Bastian