Closed ValWood closed 3 months ago
Migrate the total annotations graph from old stats to new stats - "Cumulative annotations by type and year"
I've added code to query the same stats to put them on main web site. After doing that I spent a bit of time checking the numbers. There are some oddities that we should chat about.
The current graph includes all types of evidence except IEA. Is that still correct? For the current graph we ignore "cat_act", "subunit_composition" and "pathway". But there are others that we don't ignore but maybe we should like "kegg_pombe_pathway"?
Here are are the current annotation totals that we have in the graph:
annotation_type | count
--------------------------------------+--------
EC numbers | 411
PSI-MOD | 65442
PomBase family or domain | 1859
PomBase gene characterisation status | 5145
PomBase gene products | 10041
PomGeneExProt | 1306
PomGeneExRD | 731
PomGeneExRNA | 8783
biological_process | 10492
cellular_component | 16619
complementation | 458
ex_tools | 22
fission_yeast_phenotype | 185805
gene_ex | 33603
genome_org | 163
interacts_genetically | 4017
interacts_physically | 3213
kegg_pombe_pathway | 4096
m_f_g | 76
misc | 423
molecular_function | 12991
mondo | 4504
name_description | 1000
sequence | 881
species_dist | 25244
warning | 1673
Currently the "PomBase gene products", "PomGeneExProt", "PomGeneExRD" and "PomGeneExRNA" counts are added to the "other" count. Maybe we should be ignoring "PomBase gene products"? It's included because they are stored in the same table as normal annotations.
Because the "PomBase gene products" don't have an annotation date they all appear in the "<2004" count. That seems wrong too.
Also I think it's an oversight that the "PomGeneEx" counts are added to the "other" count rather than the "Gene expression" count.
Yes, we should definitely exclude KEGG because that is an import, not our annotation.
We can include these: For the current graph we ignore "cat_act", "subunit_composition"
I'm not sure if we display "pathway" could you check?
It's taken a bit of refactoring but here's the work in progress. Please ignore the value for "Other" as that needs fixing.
The graph from Canto should be easy to implement now that all the code and data has been moved to the main website.
I'm not sure if we display "pathway" could you check?
Do you mean on the gene pages? It doesn't look like it. There are only 27 pathway annotations, all from the contig files. eg.
FT CDS complement(4252403..4252918)
FT /primary_name="pcr1"
FT /product="DNA-binding transcription factor Pcr1"
FT /synonym="mts2"
FT /systematic_id="SPAC21E11.03c"
...
FT /controlled_curation="term=pathway, links stress-activated
FT MAPK (Sty1) pathway to cAMP-dependent protein kinase
FT (Pka1) pathway; db_xref=PMID:15448137; date=20090104"
We don't need to, they should a. e covered by GO pathway annotation. If you could put the list in the curation tracker, I can check this and delete them.
Migrate the total annotations graph from old stats to new stats
That's most done for tonight's load. There is a bit of tuning to do.
It's on my desktop only for now: https://desktop.kmr.nz/curation_stats
From the last Zoom: remove the white space between the vertical segments of the graph
remove the white space between the vertical segments of the graph
I've done that, but I think we could tweak the colours a bit so that the interaction bars are easier to see.
I've made a quick fix to the palette so the interaction bars stand out more (from tomorrow). Sorry, it's not the best mix of colours. I'll work on that.
Rename old stats to "community curation stats" in menu (eventually we will migrate all to "Full curation statistics"
Is it time to add the new stats page to the menu?
Yes but culmulative annotations by year doesn't appear to be showing 398196 annotations?
Yes but culmulative annotations by year doesn't appear to be showing 398196 annotations?
Part of the is probably because only 375798 have a date as part of the annotation in Chado.
Here is a summary of the annotations in Chado with no date:
count | base_cv_name
-------+--------------------------------------
411 | EC numbers
24 | PSI-MOD
5145 | PomBase gene characterisation status
68221 | PomBase gene products
28 | PomGeneExRNA
2 | cat_act
22 | complementation
6 | external_link
77 | genome_org
12 | m_f_g
32 | misc
2851 | mondo
50 | name_description
15 | pathway
1389 | pombase_family_or_domain
709 | sequence
718 | species_dist
5 | subunit_composition
225 | warning
I think in the old stats we put
5145 | PomBase gene characterisation status as pre 2004 because it always existed.
I think in the old stats we put 5145 | PomBase gene characterisation status as pre 2004 because it always existed.
I can't see anything like that in the old code.
This is a bug though:
68221 | PomBase gene products
When I query Chado manually I see 10041 gene products. I'll fix that.
I noticed becasue this graph is close to 400,000: https://curation.pombase.org/pombe/stats/annotation and the front page number is 398196 annotations and we have a block of stuff that we class as pre 2004
I think in the old stats we put 5145 | PomBase gene characterisation status as pre 2004 because it always existed.
I can't see anything like that in the old code.
Sorry, I mis-read the Canto code. Everything without a date is put in the "<2004" bucket.
The new graph doesn't do that - I'll fix it.
Sorry, I mis-read the Canto code. Everything without a date is put in the "<2004" bucket.
The new graph doesn't do that - I'll fix it.
Done for Saturday morning.
The total annotations in the graph still don't quite much the summary table (395620 vs 398367). I'll fix that too.
After some confusion I realised that the total annotations count (398367) was wrong in the summary section of the stats page. It's more like 395620. The total is lower than the Canto stats page because it doesn't include the KEGG annotations.
I've checked in a fix for Friday night's load.
We shouldn't include the KEGG annotations as we didn't create them
Hi Val.
I haven't implemented these tables on the new pombase.org stats page. Unless you think it isn't worthwhile, I'll do that before swapping the links in the menus.
we don't have these tables on the old pages. I think the graphs are sufficient.
From the new page, we should link to the Canto page too, because that's where I get community metrics (but we can remove any duplicate graphs)
we don't have these tables on the old pages.
But that's where I got the screenshots?
Ah I see, the data is there under "view table data". I never look at that. I only look at the graphs, and the graphs are there. Maybe we don't need the raw data?
Maybe we don't need the raw data?
OK, I'll leave it out for now. Let me know if it becomes useful in future.
I for about the help text above the graphs. I think we can trim them down a little bit and make them available via a help icon or mouse-over.
Rename to "Literature and curation metrics". Change URL to /metrics.
That's done.
New URL from tomorrow: https://www.pombase.org/metrics
Re labels, agreed, Here is a shorter version for the cumulative totals:
Cumulative totals of manually curated annotations over time.
Totals include annotations made by PomBase curators and fission yeast community but exclude annotations imported from other sources, and GO annotations based on computational methods (IEA evidence). All annotation types are described in the [PomBase documentation]
I don't even think we need the other help text, its in the titles?
I've added an information icon with an mouse over:
I don't even think we need the other help text, its in the titles?
Yep, that makes sense. I'll leave those as-is.
I've now changed the links over to the new metrics page and removed the graphs from the Canto stats page that are available on the new page.
I've kept a link to the Canto stats page in the About menu but let's chat about that on the call. Maybe it would be better under the "Community" menu.
Actions from Zoom:
Latest graph after grouping evidence codes, fixing the duplicate data and using only every second gaf file from 2018-2024 (because we have monthly gaf files for those years):
Move "396125 annotations" under "4527 curated publications ..." table
I did it like this because it looked better than having a legend underneath:
Actions from Zoom:
- Move annotation type counts into a legend for the annotation types graphs.
- Move "396125 annotations" under "4527 curated publications ..." table.
- Make "Cumulative annotation type ..." graph narrower.
All done now. The changes will be visible on Wednesday morning.
great!
I will do an announcement for the new metrics with brief explanations
closing. There are lots of things to announce but I pick them up from closed tickets with "announce"
I've added the 2024-09-01 data to the graph: https://www.pombase.org/assets/pombase_history_go_ev_codes.svg https://www.pombase.org/metrics
Thanks, a lot of IEA filtering for a tine tiny drop ....