bheimbu commented 1 year ago

Hi @hbckleikamp,

I'm eager to use your scripts to build a diamond database, but when I use python GTDB_protein_download.py, I only get retry retry retry. And nothing more happens...

Can you help me here please,

Cheers Bastian

bheimbu commented 1 year ago

Hi @hbckleikamp,

I've adjusted your python script and it now runs. ChatGPT helped me to debug your code. Here is the adjusted GTDB_protein_download.py script, with progress status now.

Cheers @bheimbu

hbckleikamp commented 1 year ago

Ah sorry bheimbu for the late response, I will update your suggestions. I think the issue could be related to the hard-coded urls: meta_urls=["https://data.gtdb.ecogenomic.org/releases/latest/ar122_taxonomy.tsv.gz", "https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz"] GTDB updates the names of these files regularly with new versions, so these names are not automatically updated. You could change the names manually from this website: https://data.gtdb.ecogenomic.org/releases/latest/

But I will also add something that detects the right files automatically. Kind regards and happy to help with further questions, Hugo.

bheimbu commented 1 year ago

Hi,

it is actually related to the correct import of urllib in python3. So you simply have to change this line of your code import urllib, gzip, zipfile, shutil, tarfile to import urllib.request, gzip, zipfile, shutil, tarfile and it works.

Cheers @bheimbu

hbckleikamp commented 1 year ago

Thanks, I added your corrections. Also there is a small update in: GTDB_protein_LCA.py GTDB_protein_download.py To work for current GTDB versions as well.

bheimbu commented 1 year ago

Thanks,

I'll try.

bheimbu commented 1 year ago

Hi,

you have to include from tqdm import tqdm and import requests to GTDB_protein_download.py to make it work. Otherwise the progress status is not shown properly and the code gives "retry retry...".

hbckleikamp commented 1 year ago

Thank you, I added it.

bheimbu commented 1 year ago

Hi,

just to let you know, the script is still somehow wrong. It restarts the download again and again; never decompressing the files. I will check this again tmrrw.

Cheers Bastian

hbckleikamp commented 1 year ago

Ok sorry about that Bastian, I have an updated code for downloading and extraction in another repository https://github.com/hbckleikamp/CHEW/blob/main/Setup/4_Download_unclustered_database.py I will look to adjust the update the code with this function and your progress bar but I can only look at that on Thursday, because I am moving to another country.

bheimbu commented 1 year ago

No worries,

I've downloaded and extracted the files by hand, now GTDB_protein_rename.py is running.

Cheers Bastian

ps: Move save (I'm not sure if you actually say that in English ;) )

bheimbu commented 1 year ago

Hi,

can you tell me how to run GTDB_protein_LCA.py? I'm not sure what kind of input is required. And what the command line should look like?

Cheers Bastian

hbckleikamp commented 1 year ago

Hey Bastian, These are some very simple sets of scripts, there was no command line integration made.

Did you run a DIAMOND alignment? There are 2 steps in between the python scripts that are executed with DIAMOND. "" -4. Construction of DIAMOND database from output of 3. (diamond --makedb, see: https://github.com/bbuchfink/diamond/wiki) -5. Annotation of proteins with DIAMOND database constructed in 4. (diamond --blastp, see: https://github.com/bbuchfink/diamond/wiki) "" Completing these two steps will give you an output .tsv file. The path to this file can be copy pasted into the script GTDB_protein_LCA.py

See parameters below:

%% parameters

input_filepath = "diamond_output.tsv" #placeholder filepath for input file

I hope this is clear. Kind regards, Hugo.

bheimbu commented 1 year ago

Ok, I see. I'll let you know whether it worked or not. Do I have to use a specific command for diamond blastp or simply diamond blastp -d reference -q queries.fasta -o matches.tsv?

Cheers

hbckleikamp commented 1 year ago

Yes the latter would be fine, it is coded to operate with default parameters. You could use any other parameters here if you want, there are many options in DIAMOND to suit your needs.

The only requirments are: -The output needs to contain at least the following columns: "sseqid","qseqid","bitscore"

-And the diamond_output_columns are manually supplied in the following variable: diamond_output_columns=["qseqid", "sseqid","pident", "length", "mismatch", "gapopen" , "qstart","qend","sstart","send", "evalue","bitscore"] #default diamond output collumns

So if you change the output format, you should also change this variable in the script. Kind regards, Hugo.

bheimbu commented 1 year ago

I've build my GTDB database like this:

diamond makedb --threads 40 --in ../../GTDB2DIAMOND/GTDB/GTDB_merged/GTDB_merged.fasta -d gtdb --taxonmap gtdb/prot.accession2taxid.FULL.gz --taxonnodes gtdb/nodes.dmp --taxonnames gtdb/names.dmp.

hbckleikamp commented 1 year ago

Ok I have not run DIAMOND myself using the taxonmap annotation, I am not sure how this would interact with gtdb, since this is built around ncbi sequences. For this GTDB annotation specifically I would recommend leaving that section out, since I don't think the GTDB headers will link to the taxonmap, but you can always try.

So: diamond makedb --threads 40 --in ../../GTDB2DIAMOND/GTDB/GTDB_merged/GTDB_merged.fasta -d gtdb

bheimbu commented 1 year ago

You may be right. Because when I use my built database and following line of code: diamond blastx --db databases/gtdb.dmnd --query {input.faa} --outfmt 102 --out blastx/{wildcards.sample}_blastx.txt --max-hsps 0 --evalue 1e-24 --threads {params.threads}

This command throws an error if the database is built without taxonnames and taxonnodes. But gives me only zeros as output (no sequences are classified).

bheimbu commented 1 year ago

Or maybe this is rather related to the fact that I use blastx instead of blastp, I dont know? I'm a total newbie to functional annotation, actually. I'm more or less following this approach here, despite the fact that the scripts have several shortcomings.

hbckleikamp commented 1 year ago

Ah you use blastx ok, I have no experience with that. Let's see what results you get and if they are compatible with the output columns I mentioned above.

bheimbu commented 1 year ago

Ok. I let you know asap.

bheimbu commented 1 year ago

Ok, I've copied the blast output to the GTDB2DIAMOND directory and changed the output name to diamond_output.tsv. Now the script throws an error out about a missing folder calledGTDB-metadata?! See here:

$ python GTDB_protein_LCA.py 
/home//microbiome/functional_annotation/GTDB2DIAMOND
Traceback (most recent call last):
  File "/home/microbiome/functional_annotation/GTDB2DIAMOND/GTDB_protein_LCA.py", line 26, in <module>
    metadata_filepaths=[str(Path(basedir,"GTDB-metadata",i)) for i in os.listdir(str(Path(basedir,"GTDB-metadata"))) if i.endswith("_taxonomy.tsv")]
                                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/microbiome/functional_annotation/GTDB2DIAMOND/GTDB-metadata'

Cheers Bastian

hbckleikamp commented 1 year ago

Yes this should normally have been downloaded with the download script as well, but it is not built since you downloaded the sequences manually.

These are the urls you can use to download the metadata. https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv Then you can put them in the appropriate folder: '/home/microbiome/functional_annotation/GTDB2DIAMOND/GTDB-metadata'

bheimbu commented 1 year ago

I see. I will give it a try.

bheimbu commented 1 year ago

Ok, it works, but the results have no resembles to diamond blastp --outfmt 102 ..., which looks like this

scan3   0   0
scan7   0   0
scan9   0   0
scan13  0   0

I don't know how to go on... I'm trying to build the gtdb database with another program, hoping that this will solve my problems with --outfmt 102.

hbckleikamp commented 1 year ago

As I mentioned above: """ The only requirments are: -The output needs to contain at least the following columns: "sseqid","qseqid","bitscore"

-And the diamond_output_columns are manually supplied in the following variable: diamond_output_columns=["qseqid", "sseqid","pident", "length", "mismatch", "gapopen" , "qstart","qend","sstart","send", "evalue","bitscore"] #default diamond output collumns

So if you change the output format, you should also change this variable in the script. """

Please study the documentation of DIAMOND: https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options

Output format 102 will give you an NCBI classification, which will not work because this is a GTDB database not an NCBI database. You can use the default DIAMOND output format (6) if you want to use this tool.

bheimbu commented 1 year ago

Ah ok. I see.

hbckleikamp / GTDB2DIAMOND

retry #1

%% parameters