pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Filter redundant annotations from the same publication #1189

Closed kimrutherford closed 3 months ago

kimrutherford commented 4 months ago

From @ValWood:

I noticed that quite a few are exact duplicates which looks odd,. Cen we filter if it is exactly the same annotation / or less specific from the same publication

exact duplicate SPAC1834.07 klp3 GO:0005881 PMID:10641037 IDA SPAC1851.03 ckb1 GO:0090053 PMID:19136623 IMP SPAC25G10.07c cut7 GO:0008574 PMID:27834216 IDA SPCC736.04c gma12 GO:0000139 PMID:7522655 IDA SPBC887.14c pfh1 GO:0031297 PMID:27611590 IMP SPCC24B10.07 gad8 GO:0005737 PMID:26912660 IDA SPBC216.07c tor2 GO:0005634 PMID:26912660 IDA SPAC926.04c hsp90 GO:0016887 PMID:23664927 IDA

have more specific annotation from the same paper SPCC736.08 cbf11 GO:0003700 PMID:23555033 IMP SPCC4G3.11 nur1 GO:0034506 PMID:27451393 IDA

kimrutherford commented 4 months ago

What sort of priority is this issue? It might take an hour or two to do each part - dealing with extensions slows things down.

I think this filtering can be implemented by extending the existing GO filtering code.

ValWood commented 4 months ago

Not high. Actually, I thought we already did this with UniProt annotations.

kimrutherford commented 3 months ago

Not high. Actually, I thought we already did this with UniProt annotations.

I thought so too but I couldn't remember the details. I've found the original issue now:

The summary is that we run a process that removes annotation assigned by UniProt where there is an identical PomBase annotation. We then filter IntAct annotations where there is a PomBase annotation.

The log file: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-09/logs/log.2024-08-08-21-04-19.go-filter-uniprot-duplicates

I had a look at your list of exact duplicate examples and most are assigned by CACAO. I can add a filter for them straight away as that process already exists. I just need to add to the load script. Filtering for less specific annotations will need a code change.

kimrutherford commented 3 months ago

I just ran the CACAO filtering process on my desktop. It found 8 duplicates.

This one from your list doesn't have a duplicate anymore:

SPAC1851.03  ckb1  GO:0090053  PMID:19136623  IMP

It found one annotation that wasn't on your list:

meiotic spindle pole body       GO:0035974   SPCC1183.12.1  Inferred from Direct Assay  PMID:27630265
kimrutherford commented 3 months ago

I went ahead and added a filter for CACAO annotations that are duplicates of PomBase annotations. That has removed all the exact duplicates you listed.