Every result is "Unknown protein"

margaretc-ho commented 1 year ago

Hi there, I am trying to run AHRD on my blastp data but all I get is results like this (file is 23,000 transcripts long)

Protein-Accession Blast-Hit-Accession AHRD-Quality-Code Human-Readable-Description Interpro-ID (Description) Gene-Ontology-Term TCONS_00001348.p22 Unknown protein TCONS_00001348.p23 Unknown protein TCONS_00001348.p20 Unknown protein TCONS_00001348.p21 Unknown protein TCONS_00001348.p26 Unknown protein TCONS_00001348.p27 Unknown protein TCONS_00001348.p24 Unknown protein TCONS_00001348.p25 Unknown protein TCONS_00035809.p1 Unknown protein TCONS_00054329.p1 Unknown protein TCONS_00054329.p2 Unknown protein TCONS_00006009.p1 Unknown protein TCONS_00006009.p2 Unknown protein TCONS_00031688.p1 Unknown protein TCONS_00031688.p2 Unknown protein TCONS_00033869.p1 Unknown protein TCONS_00001348.p11 Unknown protein

This is how my .yml instruction file looks like:

proteins_fasta: ./agam/longest_orfs.pep [this was generated with transdecoder] token_score_bit_score_weight: 0.468 token_score_database_score_weight: 0.2098 token_score_overlap_score_weight: 0.3221 output: ./agam/ahrd_agam_novel_output.csv blast_dbs: swissprot: weight: 653 description_score_bit_score_weight: 2.717061 file: ./agam/blastp.outfmt6 database: ./agam/uniprot_sprot.fasta blacklist: ./test/resources/blacklist_descline.txt filter: ./test/resources/filter_descline_sprot.txt token_blacklist: ./test/resources/blacklist_token.txt

My data was originally generated using the trinotate annotation pipeline but I am trying to get nice human-readable gene descriptions. There are lots of blastp hits from blastp.outfmt6 file that are also in the uniprot_sprot.fasta file so I don't know why they are not showing up in the final output. Perhaps there is some issue with the weights but I cannot figure it out from reading the documentation.

The command I am running is: java -Xmx2g -jar dist/ahrd.jar agam/ahrd_example_input_agam_novel.yml -outfmt 6

Any help would be much appreciated and I can provide all the files via dropbox link if needed. Thanks

FlorianBoecker commented 1 year ago

Hi Margaret,

if literally ALL results are "Unknown protein", the cause is often a mismatch of the protein ID formats in the BLAST results (file: ./agam/blastp.outfmt6) and the database (database: ./agam/uniprot_sprot.fasta). For example you might have "sp|Q6GZX4|001R_FRG3G" in the database but just Q6GZX4 in your BLAST results. If you post the first ~50 lines of your BLAST results and your database file I could have a look at the ID formats.

The weights shouldn't be an issue. They are important to choose a BLAST result for annotation (especially relevant if you supply results from multiple databases). But if any result makes it through the blacklisting and filtering AHRD can't annotate "Unknown protein". No matter the weights one search result must be used to annotate the query.

You used "-outfmt 6" in the command line to call java. The format BLAST produces if you call it with "-outfmt 6" is the default for AHRD to read in. If you have BLAST results in a different format you can "tell" AHRD in the yml file with the "seq_sim_searchtable*" options (see https://github.com/groupschoof/AHRD#331-parameters-controlling-the-parsing-of-tabular-sequence-similarity-search-result-tables-legacy-blast-blast-and-blat). Java will probably not recognize "-outfmt" and give you an error or warning.

I hope this clears up some of the confusion. Let me know if you need further help.

Best, Florian

margaretc-ho commented 1 year ago

Hi Florian,

Your instinct is right about the headers

blastp.outfmt6 file (which does look like blast output format 6) appears as:

and db uniprot_sprot.fasta appears as:

However, I am still confused about how I can make AHDR recognize the BLAST output file properly, since it does look like blast output format 6. Should I modify the database file?

Thank you for your patience and help

FlorianBoecker commented 1 year ago

Hi Margaret,

AHRD recognizes the BLAST output file just fine. Outfmt 6 is in fact the expected file format. So there is no need to change anything about that. It's just that the protein IDs in the second column have been shortened to only contain the UniProtKB/Swiss-Prot entry name (https://www.uniprot.org/help/entry_name). In contrast your database file contains dabase name, accession (https://www.uniprot.org/help/accession_numbers) and the entry name. This mismatch might have it's origin from using the "parse_seqids" flag in the makeblastdb command or the "parse_deflines" flag in the blastp command.

You could fix this by changing the database file but it's easier and faster to customize the regular expression AHRD uses to read in the fasta headers. Because this might be different for different databases (if you use more than one) you have to specify it under the database:

blast_dbs:
  swissprot:
    fasta_header_regex: "^>[^|]+\\|[^|]+\\|(?<accession>\\S+)\\s+(?<description>.+?)\\s+(((OS|os)=.+)|((GN|gn)=.+))?$"

So your .yml should look like this: yml

Let me know if it works!

Best, Florian

P.S. Don't forget to use spaces instead of tabs in the .yml

margaretc-ho commented 1 year ago

Yay! It works, thank you so much! :)

groupschoof / AHRD

Every result is "Unknown protein" #21