Closed kimrutherford closed 3 months ago
What sort of priority is this issue? It might take an hour or two to do each part - dealing with extensions slows things down.
I think this filtering can be implemented by extending the existing GO filtering code.
Not high. Actually, I thought we already did this with UniProt annotations.
Not high. Actually, I thought we already did this with UniProt annotations.
I thought so too but I couldn't remember the details. I've found the original issue now:
The summary is that we run a process that removes annotation assigned by UniProt where there is an identical PomBase annotation. We then filter IntAct annotations where there is a PomBase annotation.
The log file: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-09/logs/log.2024-08-08-21-04-19.go-filter-uniprot-duplicates
I had a look at your list of exact duplicate examples and most are assigned by CACAO. I can add a filter for them straight away as that process already exists. I just need to add to the load script. Filtering for less specific annotations will need a code change.
I just ran the CACAO filtering process on my desktop. It found 8 duplicates.
This one from your list doesn't have a duplicate anymore:
SPAC1851.03 ckb1 GO:0090053 PMID:19136623 IMP
It found one annotation that wasn't on your list:
meiotic spindle pole body GO:0035974 SPCC1183.12.1 Inferred from Direct Assay PMID:27630265
I went ahead and added a filter for CACAO annotations that are duplicates of PomBase annotations. That has removed all the exact duplicates you listed.
From @ValWood:
I noticed that quite a few are exact duplicates which looks odd,. Cen we filter if it is exactly the same annotation / or less specific from the same publication
exact duplicate SPAC1834.07 klp3 GO:0005881 PMID:10641037 IDA SPAC1851.03 ckb1 GO:0090053 PMID:19136623 IMP SPAC25G10.07c cut7 GO:0008574 PMID:27834216 IDA SPCC736.04c gma12 GO:0000139 PMID:7522655 IDA SPBC887.14c pfh1 GO:0031297 PMID:27611590 IMP SPCC24B10.07 gad8 GO:0005737 PMID:26912660 IDA SPBC216.07c tor2 GO:0005634 PMID:26912660 IDA SPAC926.04c hsp90 GO:0016887 PMID:23664927 IDA
have more specific annotation from the same paper SPCC736.08 cbf11 GO:0003700 PMID:23555033 IMP SPCC4G3.11 nur1 GO:0034506 PMID:27451393 IDA