Open romseg opened 1 year ago
Dear Rom,
thanks for all the details you provided.
In order to annotate proteins with InterPro descriptions two additional input files are needed: An XML file of the InterPro database and tab separated file with the results of an InterProScan run. You can find example file stubs in the AHRD/test/resources folder: interpro_31.xml and interpro_result.raw The corresponding YAML-Keys are "interpro_database" and "interpro_result".
It looks likes you provided everything necessary for the annotation of your proteins with GO terms. If AHRD doesn't annotate GO terms at al, usually the core of problem is a mismatch of protein identifiers between the BLAST results and the GAF file. AHRD expects something like "sp|D3ZZX1|I5P1_RAT" and extracts the accession ("D3ZZX1" in this example) to find the annotations in the GAF file. Because you have this id-format in you results it looks like it should work. Can you post a short part of your BLAST results anyway? Just so I can be sure?
In general we can also recommend to use our new wrapper for AHRD: https://github.com/groupschoof/AHRD_Snakemake You'll have to install some additional software but this makes it basically foolproof to run AHRD. It automatically downloads the necessary data, performs the sequence similarity searches and then builds and runs AHRD. If you have a large query file it might need a lot of memory (see the README for details) but considering you used "-Xmx30g" when you started AHRD this should not be a problem for you.
Let me know how you proceeded!
Cheers! Florian
Hi Florian,
Thanks so much for offering solutions. Below is a part of my BLAST results:
anno1.g20065.t1 sp|Q9SE42|RPE1_ORYSJ 67.647 34 11 0 40 73 8 41 2.40e-09 54.3
anno1.g20065.t1 sp|O14105|RPE_SCHPO 50.000 34 17 0 40 73 5 38 6.04e-06 45.1
anno1.g20065.t1 sp|P46969|RPE_YEAST 48.485 33 17 0 41 73 6 38 8.85e-06 44.7
anno1.g20065.t1 sp|Q6FL81|RPE_CANGA 45.455 33 18 0 41 73 6 38 3.28e-05 43.1
anno1.g20065.t1 sp|Q8VEE0|RPE_MOUSE 47.059 34 18 0 40 73 6 39 2.46e-04 40.8
anno1.g20065.t1 sp|Q5R5Y2|RPE_PONAB 47.059 34 18 0 40 73 6 39 2.72e-04 40.4
anno1.g20065.t1 sp|Q96AT9|RPE_HUMAN 47.059 34 18 0 40 73 6 39 2.89e-04 40.4
anno1.g20065.t1 sp|Q755M2|RPE_ASHGO 42.424 33 19 0 41 73 6 38 5.03e-04 39.7
anno1.g20065.t1 sp|Q2QD12|RPEL1_HUMAN 44.118 34 19 0 40 73 6 39 0.001 38.9
anno1.g20065.t1 sp|P51012|RPE_RHOCA 41.176 34 20 0 40 73 8 41 0.001 38.9
anno1.g20065.t1 sp|Q43843|RPE_SOLTU 36.364 33 21 0 41 73 59 91 0.004 37.4
anno1.g20065.t1 sp|P44756|RPE_HAEIN 39.394 33 20 0 41 73 6 38 0.004 37.4
anno1.g20065.t1 sp|P74061|RPE_SYNY3 36.364 33 21 0 41 73 7 39 0.005 37.0
Please I will be looking forward to pinpoint the problem of missing GO terms and fixing it using the regular version of AHRD. On my computing system I have restrictions on using Conda for installation and disk space limitations for singularity, which makes a bit difficult to use AHRD_Snakemate. Thanks again.
Best regards, Romulo
Dear Romulo,
the problem is caused by a small change Uniprot made to their GAF file format. I will soon update AHRDs master branch accordingly. In the meantime you can include the following line in your yml file instead:
reference_go_regex: "^UniProtKB\\t(?<shortAccession>[^\\t]+)\\t[^\\t]+\\t(?!NOT\\|)[^\\t]*\\t(?<goTerm>GO:\\d{7})"
I tested it and I am pretty confident that it will work, but let me know if you had success anyway.
Cheers! Florian
Dear Florian,
You were right. I tried it again adding the reference_go_regex and it worked. This time I got the GO terms in my results file. Thanks a lot for your help.
One additional question, I want to add the TAIR10 Arabidopsis thaliana database to my analysis, but I am not sure where to get it from, in the proper format for AHRD. I would appreciate it having a link to the source. Thanks again.
Best regards, Romulo
Dear Romulo,
I'm glad to hear that it worked.
At first glance it makes sense to add TAIR10 if you want to annotate plant proteins. It's probably the most comprehensively annotated plant proteome. But: At some point the database went proprietary and the last public version is very old by now (https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FProteins%2FTAIR10_protein_lists). And it is our understanding that all useful information from it, has long been incorporated into the UniProtKB. Additionally, it is likely that the annotations in the UniprotKB have been updated with discoveries made in the decade since the public TAIR10 was abandoned.
Best wishes, Florian
Dear Florian,
Yes, I am annotating plant proteins, that's why my interest on TAIR, and it's good to know that TAIR is also incorporated in the larger UniprotKB. However when I used UniprotKB (uniprot_sprot.fasta) as database the annotation results include matches to other species like human and rat alongside A. thaliana (please see first ten lines below). So I am wondering if having matches to no-plant species is Okay?
# AHRD-Version 3.11
Protein-Accession Blast-Hit-Accession AHRD-Quality-Code Human-Readable-Description Interpro-ID (Description) Gene-Ontology-Term
anno2.g24355.t1 sp|D3ZZX1|I5P1_RAT --* Inositol polyphosphate-5-phosphatase A GO:0003824, GO:0004445, GO:0005886, GO:0016020, GO:0016787, GO:0016791, GO:0030425, GO:0042731, GO:0042995, GO:0046856, GO:0048015, GO:0048016, GO:0052658, GO:0052659, GO:0052745, GO:1901215
anno1.g1823.t1 sp|Q9LTA2|AHL17_ARATH *** AT-hook motif nuclear-localized protein 17 GO:0003677, GO:0003680, GO:0003700, GO:0005634, GO:0006355
anno1.g27546.t1 sp|Q5R5Q2|SRRM1_PONAB --* Serine/arginine repetitive matrix protein 1 GO:0003677, GO:0003723, GO:0005634, GO:0005681, GO:0006397, GO:0008380
anno2.g13215.t1 sp|Q9H6B1|Z385D_HUMAN --* Zinc finger protein 385D GO:0003676, GO:0005634, GO:0008270, GO:0046872, GO:1990837
anno2.g1283.t1 sp|Q9AVJ9|MXMT1_COFAR *** Monomethylxanthine methyltransferase 1 GO:0005737, GO:0008168, GO:0009820, GO:0016740, GO:0032259, GO:0046872
anno1.g17744.t1 sp|Q0AB61|MIAA_ALKEH --* tRNA dimethylallyltransferase GO:0000166, GO:0005524, GO:0008033, GO:0016740, GO:0016765, GO:0052381
anno1.g13382.t1 sp|Q31DI1|PURL_PROM9 --* Phosphoribosylformylglycinamidine synthase subunit PurL GO:0000166, GO:0000287, GO:0004642, GO:0005524, GO:0005737, GO:0006164, GO:0006189, GO:0016874, GO:0046872
anno1.g15563.t1 sp|Q941R6|MLP31_ARATH *** MLP-like protein 31 GO:0006952
anno1.g9793.t1 sp|P86276|GDL1_CARPA *** GDSL esterase/lipase GO:0005576, GO:0006629, GO:0016042, GO:0016787, GO:0016788
anno1.g30716.t1 sp|Q945Q1|CYT1_ARATH --* Cysteine proteinase inhibitor 1 GO:0000325, GO:0004869, GO:0005783, GO:0005829, GO:0005886, GO:0006952, GO:0009536, GO:0009611, GO:0010466, GO:0030414, GO:0034605, GO:0042631, GO:0070417
As a test, I also launched AHRD using the free old TAIR10 database (TAIR10_pep_20101214.fa) that you mentioned earlier. The annotation results (please see first ten lines below) obviously contain matches only to A. thaliana. I am wondering which annotations are better to use, the results obtained with the UniprotKB or TAIR10? I will appreciate it to have your opinion on this. I got 210 unknown proteins when using UniProtKB and 390 unknown proteins when using TAIR10. When using TAIR10 the GO-Ontology-Term field was empty.
# AHRD-Version 3.11
Protein-Accession Blast-Hit-Accession AHRD-Quality-Code Human-Readable-Description Interpro-ID (Description) Gene-Ontology-Term
anno2.g24355.t1 AT5G57770.1 *** Plant protein of unknown function (DUF828) with plant pleckstrin homology-like region
anno1.g1823.t1 AT1G63470.1 *-* AT hook motif DNA-binding family protein
anno1.g27546.t1 Unknown protein
anno2.g13215.t1 AT1G24256.1 --* FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: cellular_component unknown; BEST Arabidopsis thaliana protein match is: Putative endonuclease or glyc s in 7 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 93; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink).
anno2.g1283.t1 AT4G36470.1 *** S-adenosyl-L-methionine-dependent methyltransferases superfamily protein
anno1.g17744.t1 AT4G33840.1 --* Glycosyl hydrolase family 10 protein
anno1.g13382.t1 AT5G36228.1 --* nucleic acid binding;zinc ion binding
anno1.g15563.t1 AT5G28010.1 *** Polyketide cyclase/dehydrase and lipid transport superfamily protein
anno1.g9793.t1 AT1G71250.1 *** GDSL-like Lipase/Acylhydrolase superfamily protein
anno1.g30716.t1 AT5G05110.1 --* Cystatin/monellin family protein
Thank you, Romulo
Dear Romulo,
in general, we don't consider matches to species from other kingdoms to be a problem. AHRD basically takes the most similar protein from the BLAST results that has a meaningful description that also is unlikely to have been transferred from another protein in the past. So maybe there have been plant proteins in the BLAST results for eg. "anno2.g24355.t1" but they were electronically annotated themselves or their sequence is less similar to your query than the rat protein. You must also consider that many genes involved in the basic metabolism are conserved between animals and plants. The animal versions of these genes can be better annotated than any plant version. Also, proteins often have multiple functions (so called "Moonlighting proteins"). Human readable descriptions are often unsuited to reflect this properly. In contrast sets of GO terms are able to reflect this by design. So for example a plant protein might be annotated with a description involving breast cancer. This looks very wrong on the surface. But the query protein might simply be involved in cell circle regulation and the GO annotation may make this much more apparent and more reasonable for a plant protein. So if something like this "breast cancer gene" is the most similar protein with a "non garbage" description it is actually more often a better source for an annotation than not. In principal AHRD is species agnostic but you can provide it with multiple databases and use the "weight" parameter to introduce a bias towards a specific one. So if you use swissprot and tair simultaneously you can make AHRD prefer tair over swissprot if other factors are similar. I think the question which annotations are better is easy to answer: The descriptions from TAIR are just not as good. For example for your query "anno2.g24355.t1" you get "Plant protein of unknown function (DUF828) with plant pleckstrin homology-like region". It mentions another protein and domain but not what it does. "Inositol polyphosphate-5-phosphatase A" from "sp|D3ZZX1|I5P1_RAT" actually gives you information about its function. Take the annotations for "anno2.g13215.t1" as another example "Zinc finger protein 385D" vs. "FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: cellular_component unknown; BEST Arabidopsis thaliana protein match is: Putative endonuclease or glyc s in 7 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 93; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink)." I know which one I would prefer. The fact AHRD can annotate 180 proteins less when only using TAIR is another argument in favor of swissprot.
Ultimately these are all moot points if you can't provide GO annotations for the target protein to AHRD. You can download annotations for TAIR from their FTP server: "ftp://ftp.arabidopsis.org/Ontologies/Gene_Ontology/ATH_GO_GOSLIM.txt.gz". But again: These are stuck in time at some point around a decade ago. AHRD expects the reference GO annotations to be in the GAF format provided by Uniprot. So you would have to change the regular expression provided to the "reference_go_regex" yml-key to match the file provided by TAIR.
Best wishes, Florian
Hi Florian,
Thanks for taking the time to give such comprehensive explanation. It is very clear that the results obtained using the Uniprot database are of better quality and more informative. There is no reason to use TAIR for my downstream analysis, but it served to help me understand the differences in the results. I have no more questions for now. Again thanks for your time and for creating AHRD that is being so useful on my work.
Best regards, Romulo
Hi Romulo,
glad to hear that I was able to help you.
One more thing: The best way to use AHRD ist to use not only SwissProt but also TrEMBL. Admittedly annotations in TrEMBL are not as good as in Swissprot but TrEMBL can be used as a sort of fallback if SwissProt can't offer any matches at all. You just have to make sure TrEMBLs "weight" parameter is set lower than SwissProts so AHRD prioritizes them correctly. We measured the differences in terms of annotation quality and coverage and they're not huge but significant. The goa_uniprot_all.gaf already contains the GO annotations for the TrEMBL proteins (that's the reason it's so obscenely large). So you wouldn't have to add anything in this regard. But of course you would have to BLAST your query proteins in TrEMBL and provide AHRD with the fasta file of the database itself. There are ways to mitigate the impact of this in terms of storage space and computation time: Instead of TrEMBL you can use Uniref100, Uniref90 (our recommendation) or even Uniref50. They are significantly smaller because a lot of redundancy is omitted. And we also can recommend using DIAMOND (https://github.com/bbuchfink/diamond) instead of BLAST. The quality of its search results is basically indistinguishable from BLAST but it's orders of magnitude quicker. It also generates results in the exact same format as BLAST so it can be used as a drop-in replacement.
Cheers, Florian
Hi Florian,
Thanks a lot for this recommendation. Indeed, I downloaded the TrEMBL database to my system but decided not to use it because of its large size, but I would like to use Uniref90. What would be the right weight parameters to use? so that I prioritize SwissProt over TrEMBL. So far I have been using the default in the yml config file below:
swissprot:
weight: 653
description_score_bit_score_weight: 2.717061
trembl:
weight: 904
description_score_bit_score_weight: 2.590211
Thanks again.
Romulo
Hi!
The higher the weight the greater the priority when choosing a candidate protein from the search results. So I would switch the weights around:
swissprot:
weight: 904
description_score_bit_score_weight: 2.717061
trembl:
weight: 653
description_score_bit_score_weight: 2.590211
Cheers!
Great! Thanks.
Best wishes, Romulo
Hi Florian,
Sorry but what would be the fasta_header_regex for the uniref90.fasta database? I realized that this database has different headers and when I launched AHRD with Uniref90 only as a test the following report error was returned at the end of the run:
Started AHRD...
...initialised proteins in 0sec, currently occupying 62 MB
WARNING: Regular Expression '^[^|]+\|(?<shortAccession>[^|]+)' does NOT match - using pattern.find(...) - Blast Hit Accession 'UniRef90_A0A0K9QXQ1' - continuing with the original accession. This might lead to unrecognized reference GO annotations!
WARNING: Regular Expression '^[^|]+\|(?<shortAccession>[^|]+)' does NOT match - using pattern.find(...) - Blast Hit Accession 'UniRef90_A0A6N2MDR4' - continuing with the original accession. This might lead to unrecognized reference GO annotations!
WARNING: Regular Expression '^[^|]+\|(?<shortAccession>[^|]+)' does NOT match - using pattern.find(...) - Blast Hit Accession 'UniRef90_A0A5K1B7Z6' - continuing with the original accession. This might lead to unrecognized reference GO annotations!
>UniRef90_K0JAJ8 Glycoprotein (Fragment) n=1 Tax=HIV-1 M:G_PT227 TaxID=1070697 RepID=K0JAJ8_9HIV1
does not match provided regular expression
^>(?<accession>\S+)\s+(?<description>.+?)\s+(((OS|os)=.+)|((GN|gn)=.+))?$
. The header and the following entry, including possibly respective matching BLAST Hits, are ignored and discarded.
To fix this, please use - Blast database specific - parameter fasta_header_regex to provide a regular expression that matches ALL FASTA headers in Blast database 'trembl'.
The uniref90.fasta database header look like this:
>UniRef90_A0A5A9P0L4 peptidylprolyl isomerase n=1 Tax=Triplophysa tibetana TaxID=1572043 RepID=A0A5A9P0L4_9TELE
MEEITQIKKRLSQTVRLEGKEDLLSKKDSITNLKTEEHVSVKKMVISEPKPEKKEDIQLK
The yml file I used to launch AHRD:
proteins_fasta: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa
token_score_bit_score_weight: 0.468
token_score_database_score_weight: 0.2098
token_score_overlap_score_weight: 0.3221
gene_ontology_result: /home/rsegovia/scratch/rseg/databases/uniprot/goa_uniprot_all.gaf
prefer_reference_with_go_annos: true
output: ./ahrd_output.csv
reference_go_regex: "^UniProtKB\\t(?<shortAccession>[^\\t]+)\\t[^\\t]+\\t(?!NOT\\|)[^\\t]*\\t(?<goTerm>GO:\\d{7})"
blast_dbs:
trembl:
weight: 904
description_score_bit_score_weight: 2.590211
file: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa_vs_uniref90.tsv
database: /home/rsegovia/scratch/rseg/databases/uniprot/uniref90.fasta
blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_descline.txt
filter: /home/rsegovia/software/AHRD/test/resources/filter_descline_trembl.txt
token_blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_token.txt
I used Diamond for alignment and the file looks like this:
anno1.g20065.t1 UniRef90_A0A0K9QXQ1 79.4 34 7 0 40 73 6 39 1.14e-10 60.5
anno1.g20065.t1 UniRef90_A0A6N2MDR4 79.4 34 7 0 40 73 6 39 1.54e-10 60.5
anno1.g20065.t1 UniRef90_A0A5K1B7Z6 76.5 34 8 0 40 73 5 38 3.71e-10 58.9
anno1.g20065.t1 UniRef90_A0A834Z5V1 66.7 39 13 0 40 78 4 42 3.77e-10 59.7
Thank you.
Romulo
Hi!
Yes, Uniref header lines have a different format. I forgot about that, sorry.
Try using short_accession_regex:"^[^_]+_(?<shortAccession>[^_]+)$"
as a sub yml key under your trembl:
key.
Cheers!
Hi Florian,
Still it didn't work when using Uniref90 database. I have the below error report. It looks a bit different than before in that the WARNING: Regular Expression" portion is not there anymore:
Started AHRD...
...initialised proteins in 1sec, currently occupying 62 MB
WARNING: FASTA header line
>UniRef90_A0A5A9P0L4 peptidylprolyl isomerase n=1 Tax=Triplophysa tibetana TaxID=1572043 RepID=A0A5A9P0L4_9TELE
does not match provided regular expression
^>(?<accession>\S+)\s+(?<description>.+?)\s+(((OS|os)=.+)|((GN|gn)=.+))?$
. The header and the following entry, including possibly respective matching BLAST Hits, are ignored and discarded.
To fix this, please use - Blast database specific - parameter fasta_header_regex to provide a regular expression that matches ALL FASTA headers in Blast database 'trembl'.
WARNING: FASTA header line
>UniRef90_A0A410P257 Glycogen synthase n=2 Tax=Candidatus Velamenicoccus archaeovorus TaxID=1930593 RepID=A0A410P257_9BACT
does not match provided regular expression
^>(?<accession>\S+)\s+(?<description>.+?)\s+(((OS|os)=.+)|((GN|gn)=.+))?$
. The header and the following entry, including possibly respective matching BLAST Hits, are ignored and discarded.
To fix this, please use - Blast database specific - parameter fasta_header_regex to provide a regular expression that matches ALL FASTA headers in Blast database 'trembl'.
The yml file I used looks like this:
proteins_fasta: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa
token_score_bit_score_weight: 0.468
token_score_database_score_weight: 0.2098
token_score_overlap_score_weight: 0.3221
gene_ontology_result: /home/rsegovia/scratch/rseg/databases/uniprot/goa_uniprot_all.gaf
prefer_reference_with_go_annos: true
output: ./ahrd_output.csv
reference_go_regex: "^UniProtKB\\t(?<shortAccession>[^\\t]+)\\t[^\\t]+\\t(?!NOT\\|)[^\\t]*\\t(?<goTerm>GO:\\d{7})"
blast_dbs:
trembl:
weight: 904
description_score_bit_score_weight: 2.590211
file: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa_vs_uniref90.tsv
database: /home/rsegovia/scratch/rseg/databases/uniprot/uniref90.fasta
blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_descline.txt
filter: /home/rsegovia/software/AHRD/test/resources/filter_descline_trembl.txt
token_blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_token.txt
short_accession_regex: "^[^_]+_(?<shortAccession>[^_]+)$"
The resulting csv file:
# AHRD-Version 3.11
Protein-Accession Blast-Hit-Accession AHRD-Quality-Code Human-Readable-Description Interpro-ID (Description) Gene-Ontology-Term
anno2.g24355.t1 Unknown protein
anno1.g1823.t1 Unknown protein
anno1.g27546.t1 Unknown protein
The uniref90.fasta database header looks like in my previous post. I would appreciate it so much your help. Thank you.
Romulo
Dear Romulo,
sorry for the enormous delay in my response. I was on summer vacation.
The problem is basically laid out by AHRD in its error messages: The regular expression doesn't match the fasta header lines in uniref.
Add fasta_header_regex: ^>(?<accession>\S+)\s+(?<description>.+?)\s+n=\d.+$
again as a sub yml-key under your trembl:
key.
Hope this helps and is still useful to you, Florian
Dear Florian,
Thanks for providing the regex for the Uniref90 database. For now I am using the SwissProt for my downstream analysis, but I will test and switch to Uniref90 as soon as I have the chance, and it's great to have already the necessary information for the script to work. Thanks again for your kind assistance.
Best regards, Romulo
Dear author,
I got my results csv file with empty Gene-Ontology-Term and Interpro-ID (Description) fields. I am wondering how I can add those fields to the results, especially the GO term. I only used the Uniprot/Swissprot as database.
The ahrd_output.csv result file looks like this:
I launched my job with my the following:
My ahrd_example_input_go_prediction.yml:
I downloaded the Uniprot/Swissprot database from here:
The headers of uniprot_sprot.fasta file look like this:
I downloaded the goa_uniprot_all.gaf.gz file containing GO annotations for all UniprotKB proteins from here:
The goa_uniprot_all.gaf file look like this:
The report file:
I would appreciate your help. Thank you.
Rom