intermine / pombemine

0 stars 1 forks source link

GO data in pombemine #4

Closed ValWood closed 2 years ago

ValWood commented 2 years ago

As mentioned earlier. When I retrieve all GO data I get

Pombemine 38,481

but in our GO file I get user-32-29:Desktop vw253$ cat gene_association.pombase\ (8) | grep -v NCRNA | grep -v TRNA | grep -v SNRNA | grep -v SNORNA | grep -v RRNA | wc 44238
(after removing any GO term associated with an RNA)

ValWood commented 2 years ago

HAng on...I forgot to rremove tRNAs......

ValWood commented 2 years ago

Oh no, I did filter those.

I didn't look very deeply.... Are all evidence codes included? These are the evidence codes (total dataset including the non coding RNAs), might be helpful for a quick reccy.

1157 EXP 6857 HDA 1535 HMP 3353 IBA 1797 IC 8110 IDA 3334 IEA 18 IEP 774 IGI 1 IKR 4709 IMP 2941 IPI 1437 ISM 4865 ISO 1407 ISS 664 NAS 2013 ND 380 TAS

rachellyne commented 2 years ago

We include all the evidence codes. What query are you running in pombemine as I'm getting a different number to you - if you export the XML (button underneath query builder) I can import it.

ValWood commented 2 years ago

I did all protein coding genes (list upload) Then I did the first query on this ~list~ templates (retrieve GO annotations)

I can look closer later if this doesn't help...

When I do the same query at PomBAse on the same "all genes" list I get 42407 (Intermine should be slightly higher because it will include the ND evidence code)

ValWood commented 2 years ago

this is the list I uploaded all_pombe_proteins.txt

rachellyne commented 2 years ago

When I grab all genes from the GO file we loaded I get 5388 genes. When I run a query for all genes with GO terms I get 5388 genes. When I grab all the GO ids from the file I get 5087, which matches the number returned by the query. So I don't think it's a problem with the actual data loading. We are loading a file from 15th september - maybe there have been updates since then?

ValWood commented 2 years ago

There will be updates but the number change will be minimal (gene coverage similar, go annotations would not change by more than the order of ~100 in a month usually)

ValWood commented 2 years ago

If you let me know how many annotations there are by evidence code, and by aspect (MF BP CC), I might be able to figure what the problem is...

rachellyne commented 2 years ago

cellular_component | 15,352 biological_process | 11,493 molecular_function | 10,161

HDA. 6797 IDA 5770 ISO 4844 IMP. 3749 IBA 3342 IEA 3303 ND 2014 IC 1777 ISM. 1438 ISS 1408 IPI 1263 EXP. 815 IGI 700 NAS. 666 TAS. 376 IEP 18 HMP. 5 IKR 1

ValWood commented 2 years ago

OK this is what I have:

1156 EXP 6856 HDA 1535 HMP 3336 IBA 1318 IC 8077 IDA 3287 IEA 18 IEP 774 IGI 1 IKR 4668 IMP 2937 IPI 1189 ISM 4825 ISO 1409 ISS 669 NAS 378 TAS

Based on the evidence code differences, my best guess is that annotations with extensions are failing to load? v

rachellyne commented 2 years ago

Yes, I just noticed that annotation extensions are not loading. I just talked with Daniela about it. As we load them for FlyMine and HumanMine it should be an easy fix - not sure why they wouldn't have loaded first time.

ValWood commented 2 years ago

Extensions solved