groupschoof / AHRD

High throughput protein function annotation with Human Readable Description (HRDs) and Gene Ontology (GO) Terms.
https://www.cropbio.uni-bonn.de/
Other
64 stars 21 forks source link

Empty Gene-Ontology-Term and Interpro-ID (Description) fields in results #22

Open romseg opened 1 year ago

romseg commented 1 year ago

Dear author,

I got my results csv file with empty Gene-Ontology-Term and Interpro-ID (Description) fields. I am wondering how I can add those fields to the results, especially the GO term. I only used the Uniprot/Swissprot as database.

The ahrd_output.csv result file looks like this:

# AHRD-Version 3.11

Protein-Accession       Blast-Hit-Accession     AHRD-Quality-Code       Human-Readable-Description      Interpro-ID (Description)       Gene-Ontology-Term
anno2.g24355.t1 sp|D3ZZX1|I5P1_RAT      --*     Inositol polyphosphate-5-phosphatase A          
anno1.g1823.t1  sp|Q9LTA2|AHL17_ARATH   ***     AT-hook motif nuclear-localized protein 17              
anno1.g27546.t1 sp|Q5R5Q2|SRRM1_PONAB   --*     Serine/arginine repetitive matrix protein 1             
anno2.g13215.t1 sp|Q9H6B1|Z385D_HUMAN   --*     Zinc finger protein 385D                
anno2.g1283.t1  sp|Q9AVJ9|MXMT1_COFAR   ***     Monomethylxanthine methyltransferase 1          
anno1.g17744.t1 sp|Q0AB61|MIAA_ALKEH    --*     tRNA dimethylallyltransferase           
anno1.g13382.t1 sp|Q31DI1|PURL_PROM9    --*     Phosphoribosylformylglycinamidine synthase subunit PurL         
anno1.g15563.t1 sp|Q941R6|MLP31_ARATH   ***     MLP-like protein 31             
anno1.g9793.t1  sp|P86276|GDL1_CARPA    ***     GDSL esterase/lipase            

I launched my job with my the following:

java -Xmx30g -jar /home/software/AHRD/dist/ahrd.jar ./ahrd_example_input_go_prediction.yml

My ahrd_example_input_go_prediction.yml:

proteins_fasta: ./uthap1_braker.aa.fa
token_score_bit_score_weight: 0.468
token_score_database_score_weight: 0.2098
token_score_overlap_score_weight: 0.3221
gene_ontology_result: /home/scratch/databases/uniprot/goa_uniprot_all.gaf
prefer_reference_with_go_annos: true
output: ./ahrd_output.csv
blast_dbs:
  swissprot:
    weight: 653
    description_score_bit_score_weight: 2.717061
    file: ./uthap1_braker.aa.fa_vs_uniprot_sprot.txt
    database: /home/scratch/databases/uniprot/uniprot_sprot.fasta
    blacklist: /home/software/AHRD/test/resources/blacklist_descline.txt
    filter: /home/software/AHRD/test/resources/filter_descline_sprot.txt
    token_blacklist: /home/software/AHRD/test/resources/blacklist_token.txt

#  trembl:
#    weight: 904
#    description_score_bit_score_weight: 2.590211
#    file: ./test/resources/trembl_blast8_tabular.txt
#    database: ./test/resources/trembl_blast_db.fasta
#    blacklist: ./test/resources/blacklist_descline.txt
#    filter: ./test/resources/filter_descline_trembl.txt
#    token_blacklist: ./test/resources/blacklist_token.txt

#  tair:
#    weight: 854
#    description_score_bit_score_weight: 2.917405
#    file: ./test/resources/tair_blast8_tabular.txt
#    database: ./test/resources/tair_blast_db.fasta
#    fasta_header_regex: "^>(?<accession>[aA][tT][0-9mMcC][gG]\\d+(\\.\\d+)?)\\s+\\|[^\\|]+\\|\\s+(?<description>[^\\|]+)(\\s*\\|.*)?$"
#    short_accession_regex: "^(?<shortAccession>.+)$"
#    blacklist: ./test/resources/blacklist_descline.txt
#    filter: ./test/resources/filter_descline_tair.txt
#    token_blacklist: ./test/resources/blacklist_token.txt

I downloaded the Uniprot/Swissprot database from here:

wget https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz

The headers of uniprot_sprot.fasta file look like this:

>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1

I downloaded the goa_uniprot_all.gaf.gz file containing GO annotations for all UniprotKB proteins from here:

wget http://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz

The goa_uniprot_all.gaf file look like this:

!gaf-version: 2.2
!
!This file contains all GO annotations and gene product information for proteins in the UniProt KnowledgeBase (UniProtKB),
!ComplexPortal protein complexes, and RNAcentral identifiers.
!
!date-generated: 2023-05-18 13:57 
!generated-by: UniProt
!go-version: http://purl.obolibrary.org/obo/go/releases/2023-05-12/extensions/go-plus.owl
!
UniProtKB   A0A0L0PG05  obg enables GO:0005525  GO_REF:0000002  IEA InterPro:IPR006073|InterPro:IPR006074|InterPro:IPR014100|InterPro:IPR031167|InterPro:IPR045086  F   GTPase Obg  obg|ACM23_04330 protein taxon:210   20230507    InterPro        
UniProtKB   A0A0L0PG05  obg enables GO:0000287  GO_REF:0000002  IEA InterPro:IPR014100  F   GTPase Obg  obg|ACM23_04330 proteintaxon:210    20230507    InterPro        
UniProtKB   A0A0L0PG05  obg enables GO:0003924  GO_REF:0000002  IEA InterPro:IPR014100|InterPro:IPR045086   F   GTPase Obg  obg|ACM23_04330 protein taxon:210   20230507    InterPro        
UniProtKB   A0A0L0PG05  obg enables GO:0046872  GO_REF:0000043  IEA UniProtKB-KW:KW-0479    F   GTPase Obg  obg|ACM23_04330 proteintaxon:210    20230507    UniProt     
UniProtKB   A0A0L0PG05  obg located_in  GO:0005737  GO_REF:0000043  IEA UniProtKB-KW:KW-0963    C   GTPase Obg  obg|ACM23_04330protein  taxon:210   20230507    UniProt     
UniProtKB   A0A0L0PG05  obg enables GO:0000166  GO_REF:0000043  IEA UniProtKB-KW:KW-0547    F   GTPase Obg  obg|ACM23_04330 proteintaxon:210    20230507    UniProt     
UniProtKB   A0A0L0PG05  obg enables GO:0005525  GO_REF:0000043  IEA UniProtKB-KW:KW-0342    F   GTPase Obg  obg|ACM23_04330 proteintaxon:210    20230507    UniProt     
UniProtKB   A0A0L0PG05  obg enables GO:0016787  GO_REF:0000043  IEA UniProtKB-KW:KW-0378    F   GTPase Obg  obg|ACM23_04330 proteintaxon:210    20230507    UniProt     
UniProtKB   A0A0L0PG05  obg located_in  GO:0005737  GO_REF:0000044  IEA UniProtKB-SubCell:SL-0086   C   GTPase Obg  obg|ACM23_04330 protein taxon:210   20230507    UniProt     
UniProtKB   A0A0L0PG05  obg involved_in GO:0042254  GO_REF:0000104  IEA UniRule:UR001139430 P   GTPase Obg  obg|ACM23_04330protein  taxon:210   20230507    UniProt     
UniProtKB   A0A0L0PG05  obg involved_in GO:0042254  GO_REF:0000104  IEA UniRule:UR000078165 P   GTPase Obg  obg|ACM23_04330protein  taxon:210   20230507    UniProt     

The report file:

Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Usage:
java -Xmx2g -jar ahrd.jar input.yml

Started AHRD...

...initialised proteins in 1sec, currently occupying 63 MB
...parsed blast results in 430sec, currently occupying 5693 MB
...parsed reference Gene Ontology Annotations (GOA) in 836sec, currently occupying 7796 MB
...assigned highestest scoring human readable descriptions in 5sec, currently occupying 8288 MB
Writing output to './ahrd_output.csv'.
Wrote output in 0sec, currently occupying 8312 MB

DONE
I am finished!

I would appreciate your help. Thank you.

Rom

FlorianBoecker commented 1 year ago

Dear Rom,

thanks for all the details you provided.

In order to annotate proteins with InterPro descriptions two additional input files are needed: An XML file of the InterPro database and tab separated file with the results of an InterProScan run. You can find example file stubs in the AHRD/test/resources folder: interpro_31.xml and interpro_result.raw The corresponding YAML-Keys are "interpro_database" and "interpro_result".

It looks likes you provided everything necessary for the annotation of your proteins with GO terms. If AHRD doesn't annotate GO terms at al, usually the core of problem is a mismatch of protein identifiers between the BLAST results and the GAF file. AHRD expects something like "sp|D3ZZX1|I5P1_RAT" and extracts the accession ("D3ZZX1" in this example) to find the annotations in the GAF file. Because you have this id-format in you results it looks like it should work. Can you post a short part of your BLAST results anyway? Just so I can be sure?

In general we can also recommend to use our new wrapper for AHRD: https://github.com/groupschoof/AHRD_Snakemake You'll have to install some additional software but this makes it basically foolproof to run AHRD. It automatically downloads the necessary data, performs the sequence similarity searches and then builds and runs AHRD. If you have a large query file it might need a lot of memory (see the README for details) but considering you used "-Xmx30g" when you started AHRD this should not be a problem for you.

Let me know how you proceeded!

Cheers! Florian

romseg commented 1 year ago

Hi Florian,

Thanks so much for offering solutions. Below is a part of my BLAST results:

anno1.g20065.t1 sp|Q9SE42|RPE1_ORYSJ    67.647  34  11  0   40  73  8   41  2.40e-09    54.3
anno1.g20065.t1 sp|O14105|RPE_SCHPO 50.000  34  17  0   40  73  5   38  6.04e-06    45.1
anno1.g20065.t1 sp|P46969|RPE_YEAST 48.485  33  17  0   41  73  6   38  8.85e-06    44.7
anno1.g20065.t1 sp|Q6FL81|RPE_CANGA 45.455  33  18  0   41  73  6   38  3.28e-05    43.1
anno1.g20065.t1 sp|Q8VEE0|RPE_MOUSE 47.059  34  18  0   40  73  6   39  2.46e-04    40.8
anno1.g20065.t1 sp|Q5R5Y2|RPE_PONAB 47.059  34  18  0   40  73  6   39  2.72e-04    40.4
anno1.g20065.t1 sp|Q96AT9|RPE_HUMAN 47.059  34  18  0   40  73  6   39  2.89e-04    40.4
anno1.g20065.t1 sp|Q755M2|RPE_ASHGO 42.424  33  19  0   41  73  6   38  5.03e-04    39.7
anno1.g20065.t1 sp|Q2QD12|RPEL1_HUMAN   44.118  34  19  0   40  73  6   39  0.001   38.9
anno1.g20065.t1 sp|P51012|RPE_RHOCA 41.176  34  20  0   40  73  8   41  0.001   38.9
anno1.g20065.t1 sp|Q43843|RPE_SOLTU 36.364  33  21  0   41  73  59  91  0.004   37.4
anno1.g20065.t1 sp|P44756|RPE_HAEIN 39.394  33  20  0   41  73  6   38  0.004   37.4
anno1.g20065.t1 sp|P74061|RPE_SYNY3 36.364  33  21  0   41  73  7   39  0.005   37.0

Please I will be looking forward to pinpoint the problem of missing GO terms and fixing it using the regular version of AHRD. On my computing system I have restrictions on using Conda for installation and disk space limitations for singularity, which makes a bit difficult to use AHRD_Snakemate. Thanks again.

Best regards, Romulo

FlorianBoecker commented 1 year ago

Dear Romulo,

the problem is caused by a small change Uniprot made to their GAF file format. I will soon update AHRDs master branch accordingly. In the meantime you can include the following line in your yml file instead:

reference_go_regex: "^UniProtKB\\t(?<shortAccession>[^\\t]+)\\t[^\\t]+\\t(?!NOT\\|)[^\\t]*\\t(?<goTerm>GO:\\d{7})"

I tested it and I am pretty confident that it will work, but let me know if you had success anyway.

Cheers! Florian

romseg commented 1 year ago

Dear Florian,

You were right. I tried it again adding the reference_go_regex and it worked. This time I got the GO terms in my results file. Thanks a lot for your help.

One additional question, I want to add the TAIR10 Arabidopsis thaliana database to my analysis, but I am not sure where to get it from, in the proper format for AHRD. I would appreciate it having a link to the source. Thanks again.

Best regards, Romulo

FlorianBoecker commented 1 year ago

Dear Romulo,

I'm glad to hear that it worked.

At first glance it makes sense to add TAIR10 if you want to annotate plant proteins. It's probably the most comprehensively annotated plant proteome. But: At some point the database went proprietary and the last public version is very old by now (https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FProteins%2FTAIR10_protein_lists). And it is our understanding that all useful information from it, has long been incorporated into the UniProtKB. Additionally, it is likely that the annotations in the UniprotKB have been updated with discoveries made in the decade since the public TAIR10 was abandoned.

Best wishes, Florian

romseg commented 1 year ago

Dear Florian,

Yes, I am annotating plant proteins, that's why my interest on TAIR, and it's good to know that TAIR is also incorporated in the larger UniprotKB. However when I used UniprotKB (uniprot_sprot.fasta) as database the annotation results include matches to other species like human and rat alongside A. thaliana (please see first ten lines below). So I am wondering if having matches to no-plant species is Okay?

# AHRD-Version 3.11

Protein-Accession       Blast-Hit-Accession     AHRD-Quality-Code       Human-Readable-Description      Interpro-ID (Description)       Gene-Ontology-Term
anno2.g24355.t1 sp|D3ZZX1|I5P1_RAT      --*     Inositol polyphosphate-5-phosphatase A          GO:0003824, GO:0004445, GO:0005886, GO:0016020, GO:0016787, GO:0016791, GO:0030425, GO:0042731, GO:0042995, GO:0046856, GO:0048015, GO:0048016, GO:0052658, GO:0052659, GO:0052745, GO:1901215
anno1.g1823.t1  sp|Q9LTA2|AHL17_ARATH   ***     AT-hook motif nuclear-localized protein 17              GO:0003677, GO:0003680, GO:0003700, GO:0005634, GO:0006355
anno1.g27546.t1 sp|Q5R5Q2|SRRM1_PONAB   --*     Serine/arginine repetitive matrix protein 1             GO:0003677, GO:0003723, GO:0005634, GO:0005681, GO:0006397, GO:0008380
anno2.g13215.t1 sp|Q9H6B1|Z385D_HUMAN   --*     Zinc finger protein 385D                GO:0003676, GO:0005634, GO:0008270, GO:0046872, GO:1990837
anno2.g1283.t1  sp|Q9AVJ9|MXMT1_COFAR   ***     Monomethylxanthine methyltransferase 1          GO:0005737, GO:0008168, GO:0009820, GO:0016740, GO:0032259, GO:0046872
anno1.g17744.t1 sp|Q0AB61|MIAA_ALKEH    --*     tRNA dimethylallyltransferase           GO:0000166, GO:0005524, GO:0008033, GO:0016740, GO:0016765, GO:0052381
anno1.g13382.t1 sp|Q31DI1|PURL_PROM9    --*     Phosphoribosylformylglycinamidine synthase subunit PurL         GO:0000166, GO:0000287, GO:0004642, GO:0005524, GO:0005737, GO:0006164, GO:0006189, GO:0016874, GO:0046872
anno1.g15563.t1 sp|Q941R6|MLP31_ARATH   ***     MLP-like protein 31             GO:0006952
anno1.g9793.t1  sp|P86276|GDL1_CARPA    ***     GDSL esterase/lipase            GO:0005576, GO:0006629, GO:0016042, GO:0016787, GO:0016788
anno1.g30716.t1 sp|Q945Q1|CYT1_ARATH    --*     Cysteine proteinase inhibitor 1         GO:0000325, GO:0004869, GO:0005783, GO:0005829, GO:0005886, GO:0006952, GO:0009536, GO:0009611, GO:0010466, GO:0030414, GO:0034605, GO:0042631, GO:0070417

As a test, I also launched AHRD using the free old TAIR10 database (TAIR10_pep_20101214.fa) that you mentioned earlier. The annotation results (please see first ten lines below) obviously contain matches only to A. thaliana. I am wondering which annotations are better to use, the results obtained with the UniprotKB or TAIR10? I will appreciate it to have your opinion on this. I got 210 unknown proteins when using UniProtKB and 390 unknown proteins when using TAIR10. When using TAIR10 the GO-Ontology-Term field was empty.

# AHRD-Version 3.11

Protein-Accession       Blast-Hit-Accession     AHRD-Quality-Code       Human-Readable-Description      Interpro-ID (Description)       Gene-Ontology-Term
anno2.g24355.t1 AT5G57770.1     ***     Plant protein of unknown function (DUF828) with plant pleckstrin homology-like region           
anno1.g1823.t1  AT1G63470.1     *-*     AT hook motif DNA-binding family protein                
anno1.g27546.t1                 Unknown protein         
anno2.g13215.t1 AT1G24256.1     --*     FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: cellular_component unknown; BEST Arabidopsis thaliana protein match is: Putative endonuclease or glyc s in 7 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 93; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink).             
anno2.g1283.t1  AT4G36470.1     ***     S-adenosyl-L-methionine-dependent methyltransferases superfamily protein                
anno1.g17744.t1 AT4G33840.1     --*     Glycosyl hydrolase family 10 protein            
anno1.g13382.t1 AT5G36228.1     --*     nucleic acid binding;zinc ion binding           
anno1.g15563.t1 AT5G28010.1     ***     Polyketide cyclase/dehydrase and lipid transport superfamily protein            
anno1.g9793.t1  AT1G71250.1     ***     GDSL-like Lipase/Acylhydrolase superfamily protein              
anno1.g30716.t1 AT5G05110.1     --*     Cystatin/monellin family protein                

Thank you, Romulo

FlorianBoecker commented 1 year ago

Dear Romulo,

in general, we don't consider matches to species from other kingdoms to be a problem. AHRD basically takes the most similar protein from the BLAST results that has a meaningful description that also is unlikely to have been transferred from another protein in the past. So maybe there have been plant proteins in the BLAST results for eg. "anno2.g24355.t1" but they were electronically annotated themselves or their sequence is less similar to your query than the rat protein. You must also consider that many genes involved in the basic metabolism are conserved between animals and plants. The animal versions of these genes can be better annotated than any plant version. Also, proteins often have multiple functions (so called "Moonlighting proteins"). Human readable descriptions are often unsuited to reflect this properly. In contrast sets of GO terms are able to reflect this by design. So for example a plant protein might be annotated with a description involving breast cancer. This looks very wrong on the surface. But the query protein might simply be involved in cell circle regulation and the GO annotation may make this much more apparent and more reasonable for a plant protein. So if something like this "breast cancer gene" is the most similar protein with a "non garbage" description it is actually more often a better source for an annotation than not. In principal AHRD is species agnostic but you can provide it with multiple databases and use the "weight" parameter to introduce a bias towards a specific one. So if you use swissprot and tair simultaneously you can make AHRD prefer tair over swissprot if other factors are similar. I think the question which annotations are better is easy to answer: The descriptions from TAIR are just not as good. For example for your query "anno2.g24355.t1" you get "Plant protein of unknown function (DUF828) with plant pleckstrin homology-like region". It mentions another protein and domain but not what it does. "Inositol polyphosphate-5-phosphatase A" from "sp|D3ZZX1|I5P1_RAT" actually gives you information about its function. Take the annotations for "anno2.g13215.t1" as another example "Zinc finger protein 385D" vs. "FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: cellular_component unknown; BEST Arabidopsis thaliana protein match is: Putative endonuclease or glyc s in 7 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 93; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink)." I know which one I would prefer. The fact AHRD can annotate 180 proteins less when only using TAIR is another argument in favor of swissprot.

Ultimately these are all moot points if you can't provide GO annotations for the target protein to AHRD. You can download annotations for TAIR from their FTP server: "ftp://ftp.arabidopsis.org/Ontologies/Gene_Ontology/ATH_GO_GOSLIM.txt.gz". But again: These are stuck in time at some point around a decade ago. AHRD expects the reference GO annotations to be in the GAF format provided by Uniprot. So you would have to change the regular expression provided to the "reference_go_regex" yml-key to match the file provided by TAIR.

Best wishes, Florian

romseg commented 1 year ago

Hi Florian,

Thanks for taking the time to give such comprehensive explanation. It is very clear that the results obtained using the Uniprot database are of better quality and more informative. There is no reason to use TAIR for my downstream analysis, but it served to help me understand the differences in the results. I have no more questions for now. Again thanks for your time and for creating AHRD that is being so useful on my work.

Best regards, Romulo

FlorianBoecker commented 1 year ago

Hi Romulo,

glad to hear that I was able to help you.

One more thing: The best way to use AHRD ist to use not only SwissProt but also TrEMBL. Admittedly annotations in TrEMBL are not as good as in Swissprot but TrEMBL can be used as a sort of fallback if SwissProt can't offer any matches at all. You just have to make sure TrEMBLs "weight" parameter is set lower than SwissProts so AHRD prioritizes them correctly. We measured the differences in terms of annotation quality and coverage and they're not huge but significant. The goa_uniprot_all.gaf already contains the GO annotations for the TrEMBL proteins (that's the reason it's so obscenely large). So you wouldn't have to add anything in this regard. But of course you would have to BLAST your query proteins in TrEMBL and provide AHRD with the fasta file of the database itself. There are ways to mitigate the impact of this in terms of storage space and computation time: Instead of TrEMBL you can use Uniref100, Uniref90 (our recommendation) or even Uniref50. They are significantly smaller because a lot of redundancy is omitted. And we also can recommend using DIAMOND (https://github.com/bbuchfink/diamond) instead of BLAST. The quality of its search results is basically indistinguishable from BLAST but it's orders of magnitude quicker. It also generates results in the exact same format as BLAST so it can be used as a drop-in replacement.

Cheers, Florian

romseg commented 1 year ago

Hi Florian,

Thanks a lot for this recommendation. Indeed, I downloaded the TrEMBL database to my system but decided not to use it because of its large size, but I would like to use Uniref90. What would be the right weight parameters to use? so that I prioritize SwissProt over TrEMBL. So far I have been using the default in the yml config file below:

swissprot:
weight: 653
description_score_bit_score_weight: 2.717061
trembl:
weight: 904
description_score_bit_score_weight: 2.590211

Thanks again.

Romulo

FlorianBoecker commented 1 year ago

Hi!

The higher the weight the greater the priority when choosing a candidate protein from the search results. So I would switch the weights around:

swissprot:
  weight: 904
  description_score_bit_score_weight: 2.717061
trembl:
  weight: 653
  description_score_bit_score_weight: 2.590211

Cheers!

romseg commented 1 year ago

Great! Thanks.

Best wishes, Romulo

romseg commented 1 year ago

Hi Florian,

Sorry but what would be the fasta_header_regex for the uniref90.fasta database? I realized that this database has different headers and when I launched AHRD with Uniref90 only as a test the following report error was returned at the end of the run:

Started AHRD...

...initialised proteins in 0sec, currently occupying 62 MB
WARNING: Regular Expression '^[^|]+\|(?<shortAccession>[^|]+)' does NOT match - using pattern.find(...) - Blast Hit Accession 'UniRef90_A0A0K9QXQ1' - continuing with the original accession. This might lead to unrecognized reference GO annotations!
WARNING: Regular Expression '^[^|]+\|(?<shortAccession>[^|]+)' does NOT match - using pattern.find(...) - Blast Hit Accession 'UniRef90_A0A6N2MDR4' - continuing with the original accession. This might lead to unrecognized reference GO annotations!
WARNING: Regular Expression '^[^|]+\|(?<shortAccession>[^|]+)' does NOT match - using pattern.find(...) - Blast Hit Accession 'UniRef90_A0A5K1B7Z6' - continuing with the original accession. This might lead to unrecognized reference GO annotations!

>UniRef90_K0JAJ8 Glycoprotein (Fragment) n=1 Tax=HIV-1 M:G_PT227 TaxID=1070697 RepID=K0JAJ8_9HIV1
does not match provided regular expression
^>(?<accession>\S+)\s+(?<description>.+?)\s+(((OS|os)=.+)|((GN|gn)=.+))?$
. The header and the following entry, including possibly respective matching BLAST Hits, are ignored and discarded.
To fix this, please use - Blast database specific - parameter fasta_header_regex to provide a regular expression that matches ALL FASTA headers in Blast database 'trembl'.

The uniref90.fasta database header look like this:

>UniRef90_A0A5A9P0L4 peptidylprolyl isomerase n=1 Tax=Triplophysa tibetana TaxID=1572043 RepID=A0A5A9P0L4_9TELE
MEEITQIKKRLSQTVRLEGKEDLLSKKDSITNLKTEEHVSVKKMVISEPKPEKKEDIQLK

The yml file I used to launch AHRD:

proteins_fasta: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa
token_score_bit_score_weight: 0.468
token_score_database_score_weight: 0.2098
token_score_overlap_score_weight: 0.3221
gene_ontology_result: /home/rsegovia/scratch/rseg/databases/uniprot/goa_uniprot_all.gaf
prefer_reference_with_go_annos: true
output: ./ahrd_output.csv
reference_go_regex: "^UniProtKB\\t(?<shortAccession>[^\\t]+)\\t[^\\t]+\\t(?!NOT\\|)[^\\t]*\\t(?<goTerm>GO:\\d{7})"
blast_dbs:
  trembl:
    weight: 904
    description_score_bit_score_weight: 2.590211
    file: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa_vs_uniref90.tsv
    database: /home/rsegovia/scratch/rseg/databases/uniprot/uniref90.fasta
    blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_descline.txt
    filter: /home/rsegovia/software/AHRD/test/resources/filter_descline_trembl.txt
    token_blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_token.txt

I used Diamond for alignment and the file looks like this:

anno1.g20065.t1 UniRef90_A0A0K9QXQ1 79.4    34  7   0   40  73  6   39  1.14e-10    60.5
anno1.g20065.t1 UniRef90_A0A6N2MDR4 79.4    34  7   0   40  73  6   39  1.54e-10    60.5
anno1.g20065.t1 UniRef90_A0A5K1B7Z6 76.5    34  8   0   40  73  5   38  3.71e-10    58.9
anno1.g20065.t1 UniRef90_A0A834Z5V1 66.7    39  13  0   40  78  4   42  3.77e-10    59.7

Thank you.

Romulo

FlorianBoecker commented 1 year ago

Hi!

Yes, Uniref header lines have a different format. I forgot about that, sorry.

Try using short_accession_regex:"^[^_]+_(?<shortAccession>[^_]+)$" as a sub yml key under your trembl: key.

Cheers!

romseg commented 1 year ago

Hi Florian,

Still it didn't work when using Uniref90 database. I have the below error report. It looks a bit different than before in that the WARNING: Regular Expression" portion is not there anymore:

Started AHRD...

...initialised proteins in 1sec, currently occupying 62 MB
WARNING: FASTA header line
>UniRef90_A0A5A9P0L4 peptidylprolyl isomerase n=1 Tax=Triplophysa tibetana TaxID=1572043 RepID=A0A5A9P0L4_9TELE
does not match provided regular expression
^>(?<accession>\S+)\s+(?<description>.+?)\s+(((OS|os)=.+)|((GN|gn)=.+))?$
. The header and the following entry, including possibly respective matching BLAST Hits, are ignored and discarded.
To fix this, please use - Blast database specific - parameter fasta_header_regex to provide a regular expression that matches ALL FASTA headers in Blast database 'trembl'.
WARNING: FASTA header line
>UniRef90_A0A410P257 Glycogen synthase n=2 Tax=Candidatus Velamenicoccus archaeovorus TaxID=1930593 RepID=A0A410P257_9BACT
does not match provided regular expression
^>(?<accession>\S+)\s+(?<description>.+?)\s+(((OS|os)=.+)|((GN|gn)=.+))?$
. The header and the following entry, including possibly respective matching BLAST Hits, are ignored and discarded.
To fix this, please use - Blast database specific - parameter fasta_header_regex to provide a regular expression that matches ALL FASTA headers in Blast database 'trembl'.

The yml file I used looks like this:

proteins_fasta: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa
token_score_bit_score_weight: 0.468
token_score_database_score_weight: 0.2098
token_score_overlap_score_weight: 0.3221
gene_ontology_result: /home/rsegovia/scratch/rseg/databases/uniprot/goa_uniprot_all.gaf
prefer_reference_with_go_annos: true
output: ./ahrd_output.csv
reference_go_regex: "^UniProtKB\\t(?<shortAccession>[^\\t]+)\\t[^\\t]+\\t(?!NOT\\|)[^\\t]*\\t(?<goTerm>GO:\\d{7})"
blast_dbs:
  trembl:
    weight: 904
    description_score_bit_score_weight: 2.590211
    file: ./uthap1_braker1+2_tsebra2_fixed_flagOrfSizeCompl3.aa.fa_vs_uniref90.tsv
    database: /home/rsegovia/scratch/rseg/databases/uniprot/uniref90.fasta
    blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_descline.txt
    filter: /home/rsegovia/software/AHRD/test/resources/filter_descline_trembl.txt
    token_blacklist: /home/rsegovia/software/AHRD/test/resources/blacklist_token.txt
    short_accession_regex: "^[^_]+_(?<shortAccession>[^_]+)$"

The resulting csv file:

# AHRD-Version 3.11

Protein-Accession   Blast-Hit-Accession AHRD-Quality-Code   Human-Readable-Description  Interpro-ID (Description)   Gene-Ontology-Term
anno2.g24355.t1         Unknown protein     
anno1.g1823.t1          Unknown protein     
anno1.g27546.t1         Unknown protein     

The uniref90.fasta database header looks like in my previous post. I would appreciate it so much your help. Thank you.

Romulo

FlorianBoecker commented 1 year ago

Dear Romulo,

sorry for the enormous delay in my response. I was on summer vacation.

The problem is basically laid out by AHRD in its error messages: The regular expression doesn't match the fasta header lines in uniref. Add fasta_header_regex: ^>(?<accession>\S+)\s+(?<description>.+?)\s+n=\d.+$ again as a sub yml-key under your trembl: key.

Hope this helps and is still useful to you, Florian

romseg commented 1 year ago

Dear Florian,

Thanks for providing the regex for the Uniref90 database. For now I am using the SwissProt for my downstream analysis, but I will test and switch to Uniref90 as soon as I have the chance, and it's great to have already the necessary information for the script to work. Thanks again for your kind assistance.

Best regards, Romulo