intermine / pombemine

0 stars 1 forks source link

Annotation extensions not loaded. #5

Closed rachellyne closed 2 years ago

rachellyne commented 3 years ago

This is the last column in the data annotation data file. It should be in the data loader as we load in FlyMine and HumanMine.

danielabutano commented 3 years ago

@rachellyne in flymine there are no genes with annotation extensions not null. In pombemine most genes have no annotation extensions, but some have (~2500). Why do you think that extension are not loaded

rachellyne commented 3 years ago

I have no idea - a Monday brain fail. Do we have them all though? I get 6132 lines of annotation extensions when I grep the file. Let me investigate a bit further.

rachellyne commented 3 years ago

We do not seem to be loading all annotation extensions. Do we have something in the code that limits what we load? e.g query pombase:

While in the file for sre1 there are loads: sre1 : has_input(PomBase:SPAC17G6.02c),part_of(GO:0071456) : GO:0001228 sre1 : occurs_in(SO:0001861),part_of(GO:0071456) : GO:0000978 sre1 : has_input(PomBase:SPAC222.11),part_of(GO:0071456) : GO:0001228 sre1 : coincident_with(PomBase:SPAC222.11),existence_overlaps(GO:0071456) : GO:0000785 sre1 : coincident_with(PomBase:SPAC17A2.05),existence_overlaps(GO:0071456) : GO:0000785 sre1 : part_of(GO:2000639) : GO:0005515 sre1 : part_of(GO:1900038) : GO:0005515 sre1 : has_input(PomBase:SPAC22E12.03c) : GO:0001228 sre1 : has_input(PomBase:SPAC823.11) : GO:0001228 sre1 : has_input(PomBase:SPAC637.13c) : GO:0001228 sre1 : has_input(PomBase:SPBC19C2.09) : GO:0001228 sre1 : has_input(PomBase:SPAC19B12.10) : GO:0001228 sre1 : has_input(PomBase:SPBC887.15c) : GO:0001228 sre1 : has_input(PomBase:SPBC839.16) : GO:0001228 sre1 : has_input(PomBase:SPBC106.12c) : GO:0001228 sre1 : has_input(PomBase:SPBC1773.05c) : GO:0001228 sre1 : has_input(PomBase:SPBP16F5.04) : GO:0001228 sre1 : has_input(PomBase:SPAC22H10.13) : GO:0001228 sre1 : has_input(PomBase:SPAC13C5.04) : GO:0001228 sre1 : has_input(PomBase:SPAC14C4.01c) : GO:0001228 sre1 : has_input(PomBase:SPAC1687.14c) : GO:0001228 sre1 : has_input(PomBase:SPAC1565.02c) : GO:0001228 sre1 : has_input(PomBase:SPAC186.05c) : GO:0001228 sre1 : has_input(PomBase:SPAC186.08c) : GO:0001228 sre1 : has_input(PomBase:SPAC18G6.01c) : GO:0001228 sre1 : has_input(PomBase:SPAC18G6.12c) : GO:0001228 sre1 : has_input(PomBase:SPAC1B3.20) : GO:0001228 sre1 : has_input(PomBase:SPAC22E12.02) : GO:0001228 sre1 : has_input(PomBase:SPAC23C11.06c) : GO:0001228 sre1 : has_input(PomBase:SPAC24H6.08) : GO:0001228 sre1 : has_input(PomBase:SPAC26F1.07) : GO:0001228 sre1 : has_input(PomBase:SPAC2E1P3.05c) : GO:0001228 sre1 : has_input(PomBase:SPAC56E4.07) : GO:0001228 sre1 : has_input(PomBase:SPAC56F8.07) : GO:0001228 sre1 : has_input(PomBase:SPAC57A7.07c) : GO:0001228 sre1 : has_input(PomBase:SPAC637.03) : GO:0001228 sre1 : has_input(PomBase:SPAC8F11.08c) : GO:0001228 sre1 : has_input(PomBase:SPAC9.02c) : GO:0001228 sre1 : has_input(PomBase:SPAC9G1.07) : GO:0001228 sre1 : has_input(PomBase:SPBC119.03) : GO:0001228 sre1 : has_input(PomBase:SPBC1683.03c) : GO:0001228 sre1 : has_input(PomBase:SPBC215.11c) : GO:0001228 sre1 : has_input(PomBase:SPBC21C3.15c) : GO:0001228 sre1 : has_input(PomBase:SPBC23G7.10c) : GO:0001228 sre1 : has_input(PomBase:SPBC26H8.11c) : GO:0001228 sre1 : has_input(PomBase:SPBC29B5.04c) : GO:0001228 sre1 : has_input(PomBase:SPBC36B7.02) : GO:0001228 sre1 : has_input(PomBase:SPBC3B8.06) : GO:0001228 sre1 : has_input(PomBase:SPBC428.14) : GO:0001228 sre1 : has_input(PomBase:SPBPB7E8.01) : GO:0001228 sre1 : has_input(PomBase:SPCC4F11.03c) : GO:0001228 sre1 : has_input(PomBase:SPCC613.02) : GO:0001228 sre1 : has_input(PomBase:SPCP31B10.04) : GO:0001228 sre1 : has_input(PomBase:SPCC320.09) : GO:0001228 sre1 : coincident_with(PomBase:SPBC19C2.09),existence_overlaps(GO:0071456) : GO:0000785 sre1 : coincident_with(PomBase:SPAC222.11),existence_overlaps(GO:0071456) : GO:0000785 sre1 : coincident_with(PomBase:SPAC1687.16c),existence_overlaps(GO:0071456) : GO:0000785 sre1 : coincident_with(PomBase:SPAC17A2.05),existence_overlaps(GO:0071456) : GO:0000785 sre1 : coincident_with(SO:0001861),existence_overlaps(GO:0071456) : GO:0000785 sre1 : happens_during(GO:0071456) : GO:0045944 sre1 : has_input(PomBase:SPBC2D10.18) : GO:0001228 sre1 : has_input(PomBase:SPBC16A3.10) : GO:0001228 sre1 : has_input(PomBase:SPAC13G7.05) : GO:0001228 sre1 : has_input(PomBase:SPCP1E11.05c) : GO:0001228 sre1 : has_input(PomBase:SPAC3H8.06) : GO:0001228 sre1 : has_input(PomBase:SPAC631.02) : GO:0001228 sre1 : has_input(PomBase:SPAC23C4.13) : GO:0001228 sre1 : has_input(PomBase:SPCC970.03) : GO:0001228 sre1 : has_input(PomBase:SPAC22E12.04) : GO:0001228 sre1 : has_input(PomBase:SPCC162.05) : GO:0001228 sre1 : has_input(PomBase:SPAC1687.12c) : GO:0001228 sre1 : has_input(PomBase:SPCC4G3.04c) : GO:0001228 sre1 : has_input(PomBase:SPBC146.12) : GO:0001228 sre1 : has_input(PomBase:SPAC589.09) : GO:0001228 sre1 : has_input(PomBase:SPBC32F12.01c) : GO:0001228 sre1 : has_input(PomBase:SPBC23G7.16) : GO:0001228 sre1 : has_input(PomBase:SPAC589.12) : GO:0001228 sre1 : has_input(PomBase:SPAC589.08c) : GO:0001228 sre1 : has_input(PomBase:SPAC25B8.01) : GO:0001228 sre1 : has_input(PomBase:SPBC651.12c) : GO:0001228 sre1 : has_input(PomBase:SPAC13A11.02c) : GO:0001228 sre1 : has_input(PomBase:SPAC20G8.07c) : GO:0001228 sre1 : has_input(PomBase:SPBC16G5.18) : GO:0001228 sre1 : has_input(PomBase:SPAC630.08c) : GO:0001228 sre1 : has_input(PomBase:SPBC1709.07) : GO:0001228 sre1 : has_input(PomBase:SPAC1687.16c) : GO:0001228 sre1 : has_input(PomBase:SPAC19A8.04) : GO:0001228 sre1 : has_input(PomBase:SPBC16E9.05) : GO:0001228 sre1 : has_input(PomBase:SPCC1259.02c) : GO:0001228 sre1 : has_input(PomBase:SPAC26F1.04c) : GO:0001228 sre1 : has_input(PomBase:SPBC1105.05) : GO:0001228 sre1 : has_input(PomBase:SPBC3H7.13) : GO:0001228 sre1 : has_input(PomBase:SPAC22A12.06c) : GO:0001228 sre1 : has_input(PomBase:SPCC4B3.05c) : GO:0001228 sre1 : has_input(PomBase:SPAC222.11) : GO:0001228 sre1 : has_input(PomBase:SPAC1F5.07c) : GO:0001228 sre1 : has_input(PomBase:SPAC23C11.13c) : GO:0001228 sre1 : has_input(PomBase:SPAC22G7.07c) : GO:0001228 sre1 : has_input(PomBase:SPCC4F11.04c) : GO:0001228 sre1 : has_input(PomBase:SPAC26H5.13c) : GO:0001228 sre1 : has_input(PomBase:SPBC3E7.15c) : GO:0001228 sre1 : has_input(PomBase:SPBP4H10.11c) : GO:0001228 sre1 : has_input(PomBase:SPBC725.01) : GO:0001228 sre1 : has_input(PomBase:SPAC30C2.02) : GO:0001228 sre1 : has_input(PomBase:SPAC13C5.06c) : GO:0001228 sre1 : has_input(PomBase:SPAC4G9.07) : GO:0001228 sre1 : has_input(PomBase:SPAC26H5.11) : GO:0001228 sre1 : has_input(PomBase:SPAC1296.04) : GO:0001228 sre1 : has_input(PomBase:SPBC106.07c) : GO:0001228 sre1 : has_input(PomBase:SPCC16A11.10c) : GO:0001228 sre1 : has_input(PomBase:SPBC6B1.08c) : GO:0001228 sre1 : has_input(PomBase:SPAP8A3.02c) : GO:0001228 sre1 : has_input(PomBase:SPAC17A2.05) : GO:0001228 sre1 : has_input(PomBase:SPAC13A11.06) : GO:0001228 sre1 : has_input(PomBase:SPBP4G3.02) : GO:0001228 sre1 : has_input(PomBase:SPAC23C11.08) : GO:0001228 sre1 : has_input(PomBase:SPCC162.10) : GO:0001228 sre1 : has_input(PomBase:SPAC1093.01) : GO:0001228 sre1 : has_input(PomBase:SPBP23A10.09) : GO:0001228 sre1 : has_input(PomBase:SPAC1A6.05c) : GO:0001228 sre1 : has_input(PomBase:SPAC1565.01) : GO:0001228 sre1 : has_input(PomBase:SPBC17F3.01c) : GO:0001228 sre1 : has_input(PomBase:SPBC409.13) : GO:0001228 sre1 : has_input(PomBase:SPCC1450.13c) : GO:0001228 sre1 : has_input(PomBase:SPAC3C7.01c) : GO:0001228 sre1 : has_input(PomBase:SPAC19G12.08) : GO:0001228

rachellyne commented 3 years ago

I think we might be just taking the first annotation extension for each GO term, which isn't correct. Same for Human.

ValWood commented 3 years ago

Note that in an extension there are different separators. These have different meanings.

A pipe means an independent extension, a comma is a compound extension (the same notation is used in other fields)

Although, if you are loading the first extension, this means that all GO data (rows) would be loaded, and would not account for the difference in numbers? Maybe you are not loading at all any annotation which has a 'compound extension'?

rachellyne commented 3 years ago

It looks like we have modelled the Annotation extension incorrectly (maybe based on the data we had at the time which has now changed). We only allow one annotation extension per GO annotation, while it should actually be a collection of extensions. By one annotation extension, I mean everything in the column regardless of whether it contains multiple items separated. So in the above example, we load "has_input(PomBase:SPAC17G6.02c),part_of(GO:0071456)" for gene sre1 and GO term GO:0001228, but we do not load all the other rows from the go annotation file that have additional extensions for this gene and GO term. We will have to change the model and loader to fix this. (Val - you are right this might be a separate issue to issue #4 as it should not affect the actual number of GO terms loaded)

danielabutano commented 3 years ago

@rachellyne we nee to include the new class AnnotationExtension in the model. `

` What do you think?
ValWood commented 3 years ago

(Val - you are right this might be a separate issue to issue #4 as it should not affect the actual number of GO terms loaded)

It likely is the same issue IF you only load one extension per term: which I think is what you aresaying here:

"GO:0001228, but we do not load all the other rows from the go annotation file that have additional extensions for this gene and GO term."

so hopefully this will sort the umber issue.