pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

stats required for sustainability issues #1122

Closed ValWood closed 9 months ago

ValWood commented 10 months ago

This is related to the GBCR (and was a question at my departmental talk last week!) When will we reach sustainability (i.e will be able to add data to the database immediately as it is published).

I don't fully know but I think we would have definitely achieved it during this grant if we had 3 curators.

I'd like to see "The difference between the number of ‘curatable’ (6695) and ‘curated’ (4423) publications (2272) provides an alternative measure of curation completion (66% for PomBase)."

over time, because it is definitely going up more slowly (mainly since Antonia left), and probably now decreasing since Midori left and I began GO work. There are just too many other tasks to do for me to focus fully on curation.

This information would enable us to gauge what curator capacity would be required for sustainability.

ValWood commented 10 months ago

Year curatable curated (curatable, submitted date) 2011 2000 200 (10%) 2012 2500 400 (20) 2013 2800 2015 68 2023 5000 3600 (66%)

kimrutherford commented 10 months ago

Here's a one-off set of stats of curated vs curatable. I'm working on generating these numbers as part of the nightly pipeline.

Here are the numbers per year: stats-year.tsv.txt and per month: stats.tsv.txt

date curated curatable uncurated percent_curated
1969 0 1 1 0.0
1970 0 2 2 0.0
1971 0 4 4 0.0
1972 0 4 4 0.0
1973 0 11 11 0.0
1974 0 18 18 0.0
1975 0 22 22 0.0
1976 0 28 28 0.0
1977 0 43 43 0.0
1978 0 48 48 0.0
1979 0 55 55 0.0
1980 0 60 60 0.0
1981 0 72 72 0.0
1982 0 82 82 0.0
1983 0 89 89 0.0
1984 0 104 104 0.0
1985 0 119 119 0.0
1986 0 152 152 0.0
1987 0 182 182 0.0
1988 0 215 215 0.0
1989 0 258 258 0.0
1990 0 318 318 0.0
1991 0 395 395 0.0
1992 0 485 485 0.0
1993 0 582 582 0.0
1994 0 706 706 0.0
1995 0 843 843 0.0
1996 0 1004 1004 0.0
1997 0 1160 1160 0.0
1998 0 1395 1395 0.0
1999 0 1600 1600 0.0
2000 0 1813 1813 0.0
2001 0 2021 2021 0.0
2002 0 2253 2253 0.0
2003 0 2486 2486 0.0
2004 0 2725 2725 0.0
2005 0 2958 2958 0.0
2006 0 3159 3159 0.0
2007 0 3385 3385 0.0
2008 0 3634 3634 0.0
2009 0 3862 3862 0.0
2010 0 4086 4086 0.0
2011 0 4300 4300 0.0
2012 371 4559 4188 8.1
2013 827 4805 3978 17.2
2014 1729 5043 3314 34.3
2015 2309 5268 2959 43.8
2016 2735 5476 2741 49.9
2017 3023 5686 2663 53.2
2018 3336 5881 2545 56.7
2019 3683 6084 2401 60.5
2020 3915 6243 2328 62.7
2021 4119 6434 2315 64.0
2022 4254 6594 2340 64.5
2023 4434 6708 2274 66.1
kimrutherford commented 10 months ago

Percentage curated:

image

kimrutherford commented 10 months ago

image

kimrutherford commented 10 months ago

image

ValWood commented 10 months ago

This is interesting. I'm glad it isn't dropping, but you can cleary see the change from 3,2, & 1 curators

kimrutherford commented 10 months ago

I've started a stats page that's updated nightly: https://pombase.org/curation_stats

There are no links to it yet.

ValWood commented 10 months ago

We could also put the number of annotations/genes per paper on this page. because in addition to number changes with staff changes, we also need to explain the overall fewer papers curated as time goes on

kimrutherford commented 10 months ago

We could also put the number of annotations/genes per paper on this page.

I've started work on that: https://www.pombase.org/curation_stats

Please ignore the annotations per publication graphs for now as they broken.

The numbers plotted in the "Average annotated genes per publications by year range" graph are slightly different to the Canto stats page. I did quite a bit of head scratching over that but I think the new graph is correct and the Canto graph is slightly wrong.

For the Canto graph, genes are only counted if they have annotation in a Canto session. But there are some publications that have approved sessions with no Canto annotation. Examples:

The new graph includes those genes. The min and max numbers are slightly more extreme in the new graph:

kimrutherford commented 10 months ago

Please ignore the annotations per publication graphs for now as they broken.

Those graphs are much improved now. The numbers don't much the Canto graphs but they're in the right ballpark. I'll investigate the differences later.

https://www.pombase.org/curation_stats

image

ValWood commented 10 months ago

It looks good though, this will be very useful

kimrutherford commented 10 months ago

The 2021-2025 bucket on "Average high throughput annotations per publication by year range" graph is very large thanks to that big paper from Jurg's lab and there haven't been many other paper in that range to bring the average down. I wonder if there is a different way to present these numbers?

ValWood commented 10 months ago

Yes, this is a problem , If I was talking about the data I could explain it. We can just eave out the HTP graph because the thing I am trying to demo is LTP (that is where most of the effort is)

ValWood commented 9 months ago

Is this ticket finished?

kimrutherford commented 9 months ago

Yep, let's close this. When I get back to this, I'll open a new issue for new graphs.