Closed ValWood closed 6 years ago
WE also filter exact annotation duplicates with experimental annotation from i.e the same annotation from the same paper. We should extend this to include less specific annotations from the same paper, if we don't already.
The source of the PAINT annotation should eventually be http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gene-associations/submission/paint/pre-submission/
There are 8414 annotations in this file for 2340 gene products (good coverage now). I think a lot will be redundent annotaion for us though...
Any negative (NOT) annotations should be filtered. These will not be useful to our community as they are negative annotations t we would not curate/expect. For example cdc10 (a transcription factor) has NOT asparaginase
This ticket https://github.com/pombase/pombase-chado/issues/536 is partly overlapping. I'll leave it open for now as it has some useful examples for checking.
Something else to think about for the filtering. I am transferring this from https://github.com/pombase/canto/issues/1155#issuecomment-254985147 because it is more relevant here. I was confusing extensions in GOA GAF with extensions from web-services.
According to Rachel there are 98 extensions from IntAct and 4 from UniProt (this was a while ago)
Oh remember to allow "GO_central"...found my list ...this might be automatic if you load the dedicated PAINT file?
Will solve things like this.
We annotated here to MF ATP:ADP antiporter activity ISO
This is an exclusivley mitochondrial carrier activity so I annotated to mitochondrial ADP transmembrane transport and mitochondrial ATP transmembrane transport by IC to the MF
The GOC F-P links generate ADP transport ATP transport anion transmembrane transport nucleotide transmembrane transport
these are all redundant with my annotations. Its not a big issue because they are hidden in the summary view but we really don't need them at all.
The evidence code ISO will be dropped in favour of the IC annotations with the above implemented.
I'd like to raise the priority of this. It would be nicer if we retained manual annotations above IEA (to reduce the numbers)
Also I noticed this week that some annotations that we sometimes filter randomly have a bit of extra information (contributes_to qualifier for example)
SPAC13G6.04 contributes_to GO:0008565 TIM22 inner membrane protein import complex subunit Tim8 (predicted) SPAC13C5.01c contributes_to GO:0004175 20S proteasome complex subunit alpha 3 Pre9 SPAC22F3.07c contributes_to GO:0046961 F0-ATPase subunit G (predicted) SPAC22F3.07c contributes_to GO:0046933 F0-ATPase subunit G (predicted) SPAC222.03c contributes_to GO:0008565 Tim9-Tim10 complex subunit Tim10 (predicted) SPAC22A12.13 contributes_to GO:0017176 pig-P subunit (predicted) SPAC6F12.07 contributes_to GO:0015450 mitochondrial TOM complex subunit Tom20 (predicted) SPAC9.05 contributes_to GO:0003690 ATP-dependent 3' to 5' DNA helicase, FANCM ortholog Fml1 SPAC824.06 contributes_to GO:0015450 TIM23 translocase complex subunit Tim14 (predicted) SPAC17H9.16 contributes_to GO:0015450 mitochondrial TOM complex subunit Tom22 (predicted) SPACUNK4.07c contributes_to GO:0005388 P-type ATPase, calcium transporting Cta4 SPAC24C9.16c contributes_to GO:0004129 cytochrome c oxidase subunit VIII (predicted) SPAC23D3.07 contributes_to GO:0004175 20S proteasome complex subunit beta 2 Pup1 (so sometimes the "contributes to qualifier is not present)
(because we give ISS the same precedence as IEA/NAS/TAS currently)
This is a bit of a blast from the past so we can chat tomorrow about how big a task it is...
So the order for retaining should be IC ... IEA
Just to check: is the IC code the highest priority in the list? So if there are two annotation, one with IC and one with one of the others from the list (like IEA), the IC annotation should be kept and the IEA deleted?
we arbitrarily retain non-redundant annotations with IEA/NAS/TAS/IC evidence codes (I think).
The current list is:
I have a prototype implementation (assuming I'm understanding this correctly). I won't check it in until we have a chat about it.
The current code filters out IEP and RCA annotations but those evidence codes aren't in the list in your initial comment. Should they be?
That is correct. The only proviso is if there is additional information (extension or qualifier), then additional annotations should be kept.
It would be great if we could manipulate the order sometimes (for instance when we check the validity of the PAINT annotations, it would be good to put IBD and IBA to the bottom, so that most of the valid ones would be ignores, making the gaf output easier to check)
oh they could be right at the bottom. do IEP IEA RCA
(we have purged all RCA and I'm working on IEP)
However, first I would like a data run using the order ...
I think that's ready to go, with IEP, IEA, RCA added at the bottom:
inferred by curator inferred from sequence orthology inferred from sequence or structural similarity inferred from sequence model traceable author statement non-traceable author statement inferred from electronic annotation inferred from expression pattern inferred from reviewed computational analysis
temporarily inferred from biological aspect of ancestor inferred from biological aspect of descendant
I'm going to run this on my desktop to see how it goes.
OK, will you have an example gaf output? It will be easy for me to check from this if all is OK...
OK, will you have an example gaf output?
Here's the GAF output: https://www.dropbox.com/s/sbwbk43zzlz7lty/pombase-build-2018-02-14-t10.gaf.gz?dl=0
It's bigger than the usual GAF file which seems wrong? I haven't had a look at it yet.
old: cut -f7 pombase-build-2018-02-19.gaf |sort |uniq -c 569 EXP 6938 HDA 1535 HMP 1762 IC 7047 IDA 3920 IEA 29 IEP 835 IGI 1 IKR 4536 IMP 2501 IPI 961 ISM 6605 ISO 1858 ISS 749 NAS 2278 ND 352 TAS cat pombase-build-2018-02-19.gaf |wc 42476
new: cut -f7 pombase-build-2018-02-14-t10.gaf |sort |uniq -c 503 EXP 6938 HDA 1535 HMP 1577 IC 6879 IDA 3341 IEA 30 IEP 820 IGI 1 IKR 5870 IMP 2379 IPI 1564 ISM 5339 ISO 1487 ISS 694 NAS 2303 ND 413 TAS val$ cat pombase-build-2018-02-14-t10.gaf |wc 41673
thats a massive improvement...
I expected experimental to stay the same: EXP has gone down to 501, so I will investigate that....
6938 HDA same 1535 HMP same
IC has gone down, but now we are filtering this against experimental and we weren't previously so that is probable fine.
IEA has dropped A LOT 9579) which is good...
IPI, IMP, IDA have dropped quite a lot bit, not sure why... (these should not be filtered at all ??)
NDA has done up, I didn't expect this to change.
Something is not quite right, but I'm not sure what.....
Something is not quite right, but I'm not sure what.....
It's on my desktop so there a chance I don't have quite the same input files as the normal nightly load.
I'll run it again on my desktop with the old filtering and send you the results.
I'll run it again on my desktop with the old filtering and send you the results.
This should be the result using the same data files but the old filtering: https://www.dropbox.com/s/9z51woplp2ujd0r/pombase-build-2018-02-14-t11.gaf.gz?dl=0
old filtering, old files 503 EXP 6938 HDA 1535 HMP 1760 IC 6879 IDA 3935 IEA 30 IEP 820 IGI 1 IKR 5870 IMP 2379 IPI 964 ISM 6606 ISO 1857 ISS 751 NAS 2303 ND 362 TAS
Perfect! ND and all experimental are the same IC up IEA down.
I have been looking forward to this change.... we will be under 3000 IEA this year..... they will drop when we include PAINT for sure but I'm not in a hurry to test as this will bring it's own problems requiring a clean up.
One small tweak to the order: ISO ISS ISM
One small tweak to the order: ISO ISS ISM
That's the order I've been using.
OK good . I was confusing the order with that of the output!
Should I merge the filtering change into the main load?
Yes please!
OK, it's merged. The results will be available on Thursday morning.
The results will be available on Thursday morning.
It all worked OK.
Old filtering: http://curation.pombase.org/dumps/builds/pombase-build-2018-02-20/pombase-build-2018-02-20.gaf.gz New filtering: http://curation.pombase.org/dumps/builds/pombase-build-2018-02-21/pombase-build-2018-02-21.gaf.gz
I'm going to close this and open a new ticket(s) for issues arising.
Based on chat with Kim this morning.
At present we include all EXP codes.
Then, we import the GOA file which includes mainly IEA data, but the next version will also include PAINT IBA/IBD, and any manual annotation from UniProt or Intact. I'd like to do a bit more checking (and probably filtering).
At present, we arbitrarily retain non-redundant annotations with IEA/NAS/TAS/IC evidence codes (I think). This means that number of evidence codes of different types fluctuates arbitrarily between releases.
Instead, we would like to configure so that there is a precedence for evidence codes, with the ones that we are attempting to eradicate from use given a lower precedence.
So the order for retaining should be
IC ISO (I am putting ISO/ISS/ISM above IBA/IBD in the short term, this may change) ISS ISM IBA IBD TAS NAS IEA
However, first I would like a data run using the order
IC ISO (I am putting ISO/ISS/ISM above IBA/IBD in the short term, this may change) ISS ISM TAS NAS IEA IBA IBD
which will provide the smallest possible PAINT set for spot checking.