gogepp / genoboo

A collaborative notebook for genes and genomes (fork of GeneNoteBook for GOGEPP usage)
http://genenotebook.github.io/
GNU Affero General Public License v3.0
1 stars 2 forks source link

Eggnog, interpro and diamond annotations not imported #45

Closed loraine-gueguen closed 11 months ago

loraine-gueguen commented 1 year ago

Importing functional annotations (eggnog, interpro, diamond) does not work with the following data.

In fixed.gff:

VRMN01000001.1  Genbank gene    42      3256    .       +       .       ID=gene-FVE85_5056;Name=FVE85_5056;Note=POR7520..scf295_1;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FVE85_5056
VRMN01000001.1  Genbank mRNA    42      3256    .       +       .       ID=rna-gnl%7CWGS:VRMN%7Cmrna.FVE85_5056;Name=rna-gnl%7CWGS:VRMN%7Cmrna.FVE85_5056;Parent=gene-FVE85_5056;gbkey=mRNA;locus_tag=FVE85_5056;orig_protein_id=gnl%7CWGS:VRMN%7CFVE85_5056;orig_transcript_id=gnl%7CWGS:VRMN%7Cmrna.FVE85_5056;product=hypothetical protein

In protein fasta file used to generate functional annotations:

>rna-gnl%7CWGS:VRMN%7Cmrna.FVE85_5056
LLRHLAGSNHSAETLAAVPPDALRAVGLARRACGAMCLLPALSVRHQRSRCRDPARCAVAFDGQGGRVQS
AQQAGASGACDDTRSGSDTAADTHVSTARHAAAHPPSPLLDAPILTILQTVDWVLGLRVRTWKRVPAKFA
PDVALEFSSMLTELAEAASEAAQVRALGKLWVFPTLVLCLPMERQSTRARARYLATRLKMWRSDALEPLL
DSVPVVDGQHLRPIPPEAVEHRIVQHVRANHIGAAARLVESAGVHDVTDSVLARLRELHPEAGARSGSHD

In result.emapper.annotations.tsv:

TE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
rna-gnl%7CWGS:VRMN%7Cmrna.FVE85_5056    400682.PAC_15702057     1.76e-39        166.0   KOG1075@1|root,KOG1075@2759|Eukaryota   2759|Eukaryota  E       Ribonuclease H protein  -       -       -       -       -
       -       -       -       -       -       -       -       RVT_1,zf-RVT

In merged_iprscan.tsv:

rna-gnl%7CWGS:VRMN%7Cmrna.FVE85_5056    9e2aacc582bf0f0f7700d1389fd1d141        1002    Pfam    PF00078 Reverse transcriptase (RNA-dependent DNA polymerase)    452     532     7.8E-9  T       28-09-2023      IPR000477       Reverse transcriptase domain`

With the following regex:

      re_protein: '\$1'
      re_protein_capture: "^(.*)$"

Logs generated (https://gitlab.sb-roscoff.fr/abims/e-infra/gga_workflow/-/jobs/27725):

## LOG:   2023-10-05T14:57:01.255Z Server method addGenome succesfully inserted 73 elements
## LOG:   2023-10-05T14:57:01.851Z Established connection to ws://localhost:7000/websocket
## LOG:   2023-10-05T15:00:40.588Z Server method addAnnotation succesfully inserted 9898 elements
## LOG:   2023-10-05T15:00:41.015Z Established connection to ws://localhost:7000/websocket
## LOG:   2023-10-05T15:00:58.900Z Server method addInterproscan succesfully inserted undefined elements
## LOG:   2023-10-05T15:00:59.314Z Established connection to ws://localhost:7000/websocket
## LOG:   2023-10-05T15:01:12.244Z Server method addEggnog succesfully inserted 0 elements
## LOG:   2023-10-05T15:01:12.645Z Established connection to ws://localhost:7000/websocket
## LOG:   2023-10-05T15:04:25.760Z Server method addSimilarSequence succesfully inserted 0 elements
[...]
## WARNING: 2023-10-05T15:01:00.062Z
Warning ! rna-gnl%7CWGS:VRMN%7Cmrna.FVE85_5056 eggnog annotation did
not find a matching protein domain in the genes database.
rna-gnl%7CWGS:VRMN%7Cmrna.FVE85_5056 is not added to the eggnog database.
[...]

Genenotebook page: image

Protein ID in GNB: image

loraine-gueguen commented 1 year ago

Mongo search decodes ID, so it doesn't match anymore?

mboudet commented 11 months ago

Added decode to funct annotation import