jorvis / Attributor

Generate gene annotation from a wide variety of evidence sources
Apache License 2.0
2 stars 1 forks source link

Problematic annotation assignments #3

Open kifeonu opened 7 years ago

kifeonu commented 7 years ago

1: All annotations that match rapsearch2uniref100trusted_full_partial are being annotated as 'hypothetical protein domain protein' and gene symbol set to ’None’ Where rapsearch2uniref100trusted_full_partial is defined as:

2: class:trusted may not be pulling actual ‘trusted’ matches Example: Set product name to 'hypothetical protein domain protein' from rapsearch2uniref100trusted_full_partial hit to UniRef100_UPI00037D6DCF hypothetical protein n=1 Tax=Brevibacillus laterosporus RepID=UPI00037D6DCF “hypothetical protein” shouldn’t be “trusted”

3: Set default GO annotation to GO:0008150,GO:0003674,GO:0005575

4: It doesn’t seem like gene symbols are added

5: Lowercase the beginning of all names (except abbreviations)

6: No matches using ‘rapsearch2uniref100trusted_full_full’ and ‘rapsearch2uniref100trusted_full_partial’ could it be that percent_identity_cutoff: 40% is limiting all hits?

7: Post-assignment name processing …family protein family protein => …family protein …family transporter protein family protein => …transporter family protein …family family protein => …family protein …domain family protein => …domain protein …domain domain protein => …domain protein …Protein family protein => …family protein …protein domain protein => …domain protein Domain of Unknown Function… => “conserved hypothetical protein” possibly incorporate rules in /usr/local/projects/ergatis/package-latest/bin/curate_common_names.pl

nsuvarnaiari commented 7 years ago

I ran functional annotation pipeline on protein sequences and then ran attributor to incorporate annotations in the headers of the pep fasta file. I have noticed 4, 5 and 7 issues (from Kemi's previous post) in my final output.

Had to edit the product names manually to correct issue 7.

Eg: filamentous hemagglutinin family N-terminal domain domain protein Methyltransferase FkbM domain family protein

/local/projects/aengine/organisms/Herve_pep_reannotation/pep_reannotation.faa

jorvis commented 7 years ago

Looking at this while I have some free time on vacation. Can you send the path to Herve's full polypeptide fasta file?

@kabolude, thanks for the link to the curation script. Do you still have your source polypeptide fasta file I could test?

nsuvarnaiari commented 7 years ago

Hi Josh,

Here is the path to Herve’s polypeptide file, /local/projects/aengine/organisms/Herve_pep_reannotation/ORFs_inTables_toReBlast.pep

Just letting you know, these protein sequences are from GenBank, there is NCBI annotation for each polypeptide in the header. Herve wanted reannotation.

Thanks, Suvvi

From: Joshua Orvis [mailto:notifications@github.com] Sent: Wednesday, July 12, 2017 6:03 PM To: jorvis/Attributor Cc: Nadendla, Suvarna; Comment Subject: Re: [jorvis/Attributor] Problematic annotation assignments (#3)

Looking at this while I have some free time on vacation. Can you send the path to Herve's full polypeptide fasta file?

@kaboludehttps://github.com/kabolude, thanks for the link to the curation script. Do you still have your source polypeptide fasta file I could test?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/jorvis/Attributor/issues/3#issuecomment-314910665, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMlxTx-rwgOl5BHBzAtHLSDGNnshCUAAks5sNUKVgaJpZM4Lu1Yo.

jorvis commented 7 years ago

As an update, most of the rules from curate_common_names.pl have now been integrated as a method in the biocode.annotation module.

https://github.com/jorvis/biocode/commit/c96bd818eb678735a0030705d2d6f8e5c67f3c87