groupschoof / AHRD

High throughput protein function annotation with Human Readable Description (HRDs) and Gene Ontology (GO) Terms.
https://www.cropbio.uni-bonn.de/
Other
63 stars 21 forks source link

Regular Expression '^[^|]+\|(?<shortAccession>[^|]+)' does NOT match #15

Closed pezhmansafdari closed 6 years ago

pezhmansafdari commented 6 years ago

Hi, I am trying to run AHRD in GO mode and I keep receiving the following error:

WARNING: Regular Expression '^[^|]+|(?[^|]+)' does NOT match - using pattern.find(...) - Blast Hit Accession 'Q8M985' - continuing with the original accession. This might lead to unrecognized reference GO annotations! I have downloaded the GO DB from UniProt and done blastp version 6. Would you please let me know what is the problem? BR, Pezhman

pezhmansafdari commented 6 years ago

Hi, I have found the source of the problem. I have the: A0A1S3YUK8 in the second column of my BLAST file instead of : tr|A0A177VA33|A0A177VA33_9BASI Is there any way that I can correct for this? BR, Pezhman

lucventurini commented 6 years ago

Hi @pezhmansafdari, did you by any chance modify the downloaded uniprot file, by e.g. parsing with a script? That could explain the error, and if that is the case, you could just rerun the BLAST.

Otherwise, a quite simple (but involved) solution would be to correct the BLAST file by substituting the values in the second column with the correct values from the original UniProt file. Unfortunately, you would have to write such a script yourself, I doubt there would be an off-the-shelf utility for it.

pezhmansafdari commented 6 years ago

Hi, I realized this is because I had used –parse_seqids in creating the blast db. So, it is possible to get the IDs only in the second column. Plus, when I try to create a DB without the –parse_seqids option for the whole UniProt db, it throws duplicated seqids error.