pombase / website

PomBase website v2
MIT License
6 stars 1 forks source link

statistics pages #2142

Closed ValWood closed 2 months ago

ValWood commented 7 months ago
  1. [x] Migrate the total annotations graph from old stats to new stats - "Cumulative annotations by type and year"
  2. [x] Remove genes and annotations per paper from old stats - "Annotations/genes for low throughput publication in 5-year intervals"
  3. [x] Rename old stats to "community curation stats" in menu (eventually we will migrate all to "Full curation statistics"
  4. [ ] Announce the new page
kimrutherford commented 7 months ago

Migrate the total annotations graph from old stats to new stats - "Cumulative annotations by type and year"

I've added code to query the same stats to put them on main web site. After doing that I spent a bit of time checking the numbers. There are some oddities that we should chat about.

The current graph includes all types of evidence except IEA. Is that still correct? For the current graph we ignore "cat_act", "subunit_composition" and "pathway". But there are others that we don't ignore but maybe we should like "kegg_pombe_pathway"?

Here are are the current annotation totals that we have in the graph:

           annotation_type            | count  
--------------------------------------+--------
 EC numbers                           |    411
 PSI-MOD                              |  65442
 PomBase family or domain             |   1859
 PomBase gene characterisation status |   5145
 PomBase gene products                |  10041
 PomGeneExProt                        |   1306
 PomGeneExRD                          |    731
 PomGeneExRNA                         |   8783
 biological_process                   |  10492
 cellular_component                   |  16619
 complementation                      |    458
 ex_tools                             |     22
 fission_yeast_phenotype              | 185805
 gene_ex                              |  33603
 genome_org                           |    163
 interacts_genetically                |   4017
 interacts_physically                 |   3213
 kegg_pombe_pathway                   |   4096
 m_f_g                                |     76
 misc                                 |    423
 molecular_function                   |  12991
 mondo                                |   4504
 name_description                     |   1000
 sequence                             |    881
 species_dist                         |  25244
 warning                              |   1673

Currently the "PomBase gene products", "PomGeneExProt", "PomGeneExRD" and "PomGeneExRNA" counts are added to the "other" count. Maybe we should be ignoring "PomBase gene products"? It's included because they are stored in the same table as normal annotations.

Because the "PomBase gene products" don't have an annotation date they all appear in the "<2004" count. That seems wrong too.

Also I think it's an oversight that the "PomGeneEx" counts are added to the "other" count rather than the "Gene expression" count.

ValWood commented 7 months ago

Yes, we should definitely exclude KEGG because that is an import, not our annotation.

We can include these: For the current graph we ignore "cat_act", "subunit_composition"

I'm not sure if we display "pathway" could you check?

kimrutherford commented 7 months ago

It's taken a bit of refactoring but here's the work in progress. Please ignore the value for "Other" as that needs fixing.

The graph from Canto should be easy to implement now that all the code and data has been moved to the main website.

image

kimrutherford commented 7 months ago

I'm not sure if we display "pathway" could you check?

Do you mean on the gene pages? It doesn't look like it. There are only 27 pathway annotations, all from the contig files. eg.

FT   CDS             complement(4252403..4252918)
FT                   /primary_name="pcr1"
FT                   /product="DNA-binding transcription factor Pcr1"
FT                   /synonym="mts2"
FT                   /systematic_id="SPAC21E11.03c"
...
FT                   /controlled_curation="term=pathway, links stress-activated
FT                   MAPK (Sty1) pathway to cAMP-dependent protein kinase
FT                   (Pka1) pathway; db_xref=PMID:15448137; date=20090104"
ValWood commented 7 months ago

We don't need to, they should a. e covered by GO pathway annotation. If you could put the list in the curation tracker, I can check this and delete them.

kimrutherford commented 7 months ago

Migrate the total annotations graph from old stats to new stats

That's most done for tonight's load. There is a bit of tuning to do.

It's on my desktop only for now: https://desktop.kmr.nz/curation_stats

image

kimrutherford commented 7 months ago

From the last Zoom: remove the white space between the vertical segments of the graph

kimrutherford commented 7 months ago

remove the white space between the vertical segments of the graph

I've done that, but I think we could tweak the colours a bit so that the interaction bars are easier to see.

https://www.pombase.org/curation_stats

image

kimrutherford commented 6 months ago

I've made a quick fix to the palette so the interaction bars stand out more (from tomorrow). Sorry, it's not the best mix of colours. I'll work on that.

image

kimrutherford commented 6 months ago

Rename old stats to "community curation stats" in menu (eventually we will migrate all to "Full curation statistics"

Is it time to add the new stats page to the menu?

https://www.pombase.org/curation_stats

ValWood commented 6 months ago

Yes but culmulative annotations by year doesn't appear to be showing 398196 annotations?

kimrutherford commented 6 months ago

Yes but culmulative annotations by year doesn't appear to be showing 398196 annotations?

Part of the is probably because only 375798 have a date as part of the annotation in Chado.

kimrutherford commented 6 months ago

Here is a summary of the annotations in Chado with no date:

 count |             base_cv_name             
-------+--------------------------------------
   411 | EC numbers
    24 | PSI-MOD
  5145 | PomBase gene characterisation status
 68221 | PomBase gene products
    28 | PomGeneExRNA
     2 | cat_act
    22 | complementation
     6 | external_link
    77 | genome_org
    12 | m_f_g
    32 | misc
  2851 | mondo
    50 | name_description
    15 | pathway
  1389 | pombase_family_or_domain
   709 | sequence
   718 | species_dist
     5 | subunit_composition
   225 | warning
ValWood commented 6 months ago

I think in the old stats we put

5145 | PomBase gene characterisation status as pre 2004 because it always existed.

kimrutherford commented 6 months ago

I think in the old stats we put 5145 | PomBase gene characterisation status as pre 2004 because it always existed.

I can't see anything like that in the old code.

This is a bug though:

68221 | PomBase gene products

When I query Chado manually I see 10041 gene products. I'll fix that.

ValWood commented 6 months ago

I noticed becasue this graph is close to 400,000: https://curation.pombase.org/pombe/stats/annotation and the front page number is 398196 annotations and we have a block of stuff that we class as pre 2004

kimrutherford commented 6 months ago

I think in the old stats we put 5145 | PomBase gene characterisation status as pre 2004 because it always existed.

I can't see anything like that in the old code.

Sorry, I mis-read the Canto code. Everything without a date is put in the "<2004" bucket.

The new graph doesn't do that - I'll fix it.

kimrutherford commented 6 months ago

Sorry, I mis-read the Canto code. Everything without a date is put in the "<2004" bucket.

The new graph doesn't do that - I'll fix it.

Done for Saturday morning.

The total annotations in the graph still don't quite much the summary table (395620 vs 398367). I'll fix that too.

image

kimrutherford commented 6 months ago

After some confusion I realised that the total annotations count (398367) was wrong in the summary section of the stats page. It's more like 395620. The total is lower than the Canto stats page because it doesn't include the KEGG annotations.

I've checked in a fix for Friday night's load.

ValWood commented 6 months ago

We shouldn't include the KEGG annotations as we didn't create them

kimrutherford commented 6 months ago
kimrutherford commented 6 months ago

Hi Val.

I haven't implemented these tables on the new pombase.org stats page. Unless you think it isn't worthwhile, I'll do that before swapping the links in the menus.

image

image

ValWood commented 6 months ago

we don't have these tables on the old pages. I think the graphs are sufficient.

From the new page, we should link to the Canto page too, because that's where I get community metrics (but we can remove any duplicate graphs)

kimrutherford commented 6 months ago

we don't have these tables on the old pages.

But that's where I got the screenshots?

ValWood commented 6 months ago

Ah I see, the data is there under "view table data". I never look at that. I only look at the graphs, and the graphs are there. Maybe we don't need the raw data?

kimrutherford commented 6 months ago

Maybe we don't need the raw data?

OK, I'll leave it out for now. Let me know if it becomes useful in future.

kimrutherford commented 6 months ago

I for about the help text above the graphs. I think we can trim them down a little bit and make them available via a help icon or mouse-over.

image

image

kimrutherford commented 5 months ago

Rename to "Literature and curation metrics". Change URL to /metrics.

That's done.

New URL from tomorrow: https://www.pombase.org/metrics

ValWood commented 5 months ago

Re labels, agreed, Here is a shorter version for the cumulative totals:

Cumulative totals of manually curated annotations over time.

Totals include annotations made by PomBase curators and fission yeast community but exclude annotations imported from other sources, and GO annotations based on computational methods (IEA evidence). All annotation types are described in the [PomBase documentation]

ValWood commented 5 months ago

I don't even think we need the other help text, its in the titles?

kimrutherford commented 5 months ago

I've added an information icon with an mouse over:

image

I don't even think we need the other help text, its in the titles?

Yep, that makes sense. I'll leave those as-is.

kimrutherford commented 5 months ago

I've now changed the links over to the new metrics page and removed the graphs from the Canto stats page that are available on the new page.

I've kept a link to the Canto stats page in the About menu but let's chat about that on the call. Maybe it would be better under the "Community" menu.

image

kimrutherford commented 5 months ago

Actions from Zoom:

kimrutherford commented 5 months ago

Latest graph after grouping evidence codes, fixing the duplicate data and using only every second gaf file from 2018-2024 (because we have monthly gaf files for those years):

figure

kimrutherford commented 5 months ago

Move "396125 annotations" under "4527 curated publications ..." table

I did it like this because it looked better than having a legend underneath:

image

kimrutherford commented 5 months ago

Actions from Zoom:

  • Move annotation type counts into a legend for the annotation types graphs.
  • Move "396125 annotations" under "4527 curated publications ..." table.
  • Make "Cumulative annotation type ..." graph narrower.

All done now. The changes will be visible on Wednesday morning.

ValWood commented 5 months ago

great!

ValWood commented 3 months ago

I will do an announcement for the new metrics with brief explanations

ValWood commented 2 months ago

closing. There are lots of things to announce but I pick them up from closed tickets with "announce"

kimrutherford commented 1 month ago

I've added the 2024-09-01 data to the graph: https://www.pombase.org/assets/pombase_history_go_ev_codes.svg https://www.pombase.org/metrics

ValWood commented 1 month ago

Thanks, a lot of IEA filtering for a tine tiny drop ....