pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

community curation response rate #651

Closed ValWood closed 7 months ago

ValWood commented 6 years ago

In the stats, we report the response rate as a percentage (currently around 42%). It goes up, but very slowly. It would be nice to have a cumulative graph showing the growth over time eventually (the only way is up)

ValWood commented 6 years ago

I mentioned today in the group meeting that this had gone up to 43.6% recently last week... I think it's statistically significant because a) It's a complete dataset , not a sample, in which case you don't require statistics to explain the increase? and b) the number is just the ratio of curated vs. non curated out of the sessions sent out? and it is continually increasing...

If we plotted the response rate I'm sure it is a continuously upward trajectory... which is basically what we are interested in...I want to get to 50% this year...

@kimrutherford is it easy to include this as a graph in the stats? It would be much nicer than the number. It's not urgent but it might be a nice quick task if you want something "alternative" to the big browser elephant....

Does that all make sense?

kimrutherford commented 6 years ago

is it easy to include this as a graph in the stats?

All the data is available so it wouldn't be too hard.

There are some edge cases to think about. Like this session which was sent out twice, in different years: https://curation.pombase.org/pombe/view/object/curs/4315?model=track should that count towards 2016 or 2017?

ValWood commented 6 years ago

I envisaged that we would just use the ratio of the ones which are sent out vs. the one sent back.

So, the numbers

To date 1361 publications have been assigned to community members for curation. 597 are finished and are either in the main PomBase database or are currently being checked by the PomBase curators. That's a response rate of 43.8%.

so its always the first date sent out (things which are sent out multiple times are just reminders).

I envisage that the graph will look like this:

20180130_145240_resized

i.e goes up continually but very slowly.

I'm keep it going up by sending out enough reminders to sustain an increase. I don't send out too many at once as we would be swamped...

Eventually it will plateau when we are just left with the people who will never do any. We are a long way from that yet.... I'm still getting lots of "sorry I will do it" and a good uptake when I send reminders, even for old sessions...

ValWood commented 6 years ago

y axis is %

ValWood commented 6 years ago

I might be wrong because I don't know what the graph would look like at the start when the number of session was low! Actually I think it may begin at about 30%. Certainly for the past few years it has been going up slowly (this is partially due to the fact that the uptake on new papers is usually more immediate, it's old ones that are stagnating....)

ValWood commented 6 years ago

44.1%. .....we will get to 50% by the end of the year I'm sure.....

ValWood commented 6 years ago

44.3%.....

ValWood commented 6 years ago

It was 32% when I did this presentation: https://www.slideshare.net/ValerieWood/community-curation-at-pombase (I cant remember when, I think it was about 18 months ago)

ValWood commented 6 years ago

44.4% I do wish I had't sent out so many reminders at once...I want it to stop..... No more until these dry up.....

ValWood commented 6 years ago

Hi @kimrutherford what's your question here. I should be able to describe better.

kimrutherford commented 6 years ago

Hi @kimrutherford what's your question here. I should be able to describe better.

I think this answers my question:

so its always the first date sent out (things which are sent out multiple times are just reminders).

I think I mis-read it and then added "discuss".

ValWood commented 5 years ago

Will keep this open, would nice to see the cumulative increase on the stats page: https://curation.pombase.org/pombe/stats/annotation

kimrutherford commented 4 years ago

It would be nice to have a cumulative graph showing the growth over time eventually (the only way is up)

Is that true? If you sent out a bunch of sessions won't the response rate (temporarily) drop?

ValWood commented 4 years ago

the drop is usually less than a fraction of % point so it won't show in the plot.

Screenshot 2020-02-27 at 10 28 53

if it ever dropped I would send out more reminders ;)

ValWood commented 4 years ago

actually, that isn't the response rate graph, its the other one (2B), they look similar.

I would upload it but i need to swap laptops and mail it to myself because I can't upload to github on the other laptop. I really need to sort my environment!

kimrutherford commented 4 years ago

I've done some querying in Chado. I think the numbers don't match up with the 50% response rate shown in Canto because not all of the publications in Canto are exported to Chado. There are community sessions triaged as "Erratum" and "Wrong organism" for example which aren't exported.

I've made a new report "uncuratable publications with a community session" to help work this out: https://curation.pombase.org/pombe/view/list/uncuratable_publications_with_a_community_session?model=track

Is a session is approved, the Canto details are exported to Chado regardless of the triage status.

This publication is an Erratum, but has an approved session: https://curation.pombase.org/pombe/view/object/pub/11918?model=track

Here are the numbers from Chado:

 year | submitted | sent_sessions | response_rate 
------+-----------+---------------+---------------
 2013 |        91 |           927 |           9.8
 2014 |       174 |          1055 |          16.4
 2015 |       260 |          1171 |          22.2
 2016 |       403 |          1280 |          31.4
 2017 |       502 |          1378 |          36.4
 2018 |       641 |          1475 |          43.4
 2019 |       771 |          1579 |          48.8
 2020 |       800 |          1593 |          50.2

Note to self, query with:

WITH counts as (SELECT year,

  (SELECT COUNT (*)
   FROM pombase_publication_curation_summary
   WHERE canto_curator_role = 'community'
   AND (canto_annotation_status = 'NEEDS_APPROVAL' OR canto_annotation_status = 'APPROVAL_IN_PROGRESS' OR canto_annotation_status = 'APPROVED')
     AND (canto_session_submitted_date IS NOT NULL
          AND canto_session_submitted_date <= (YEAR || '-12-30')::date)) AS submitted,

  (SELECT COUNT (*)
   FROM pombase_publication_curation_summary
   WHERE canto_curator_role = 'community'
   AND (canto_approved_date is not null OR canto_first_sent_to_curator_year IS NOT NULL
     AND canto_first_sent_to_curator_year <= YEAR)) AS sent_sessions

FROM generate_series(2013,
                       (SELECT extract(YEAR
                                       FROM CURRENT_DATE))::integer) AS YEAR)
SELECT year, submitted, sent_sessions, trunc(100.0*submitted/sent_sessions,1) as response_rate from counts;
ValWood commented 4 years ago

Ah OK.

PMID:31579888 is the one which had 2 PMIDs. This ID will be deleted.

Some are methods papers. Occasionally people get annotations from methods papers. We want to class these as "methods" & "curated"

One day we need to sort the classification so the "publication type" and " curation status" are separate

ValWood commented 4 years ago

I removed the sessions. I'm guessing we don't include any session that no longer exists? The numbers should not be affected much. There were similar numbers of "IN PROGRESS" and "APPROVED"

ValWood commented 4 years ago

Phew, I promise I did not "fix" this:

To date 1578 publications have been assigned to community members for curation. 789 are finished and are either in the main PomBase database or are currently being checked by the PomBase curators. That's a response rate of 50%.

It's still 50%!

kimrutherford commented 4 years ago

I removed the sessions.

Thanks.

I'm guessing we don't include any session that no longer exists?

Yep, they will disappear from Chado in tonight's load. I'll run that response rate query again tomorrow.

kimrutherford commented 4 years ago

The query seemed to update itself anyway a short while after I deleted the sessions ?

The response on the Canto stats page is queried straight from Canto's database. There is an up to 10 minute delay seeing changing because the page contents are cached for speed.

ValWood commented 4 years ago

good, so we are still at 50%

kimrutherford commented 4 years ago

The numbers almost match now:

 year | submitted | sent_sessions | response_rate
------+-----------+---------------+---------------
 2013 |        90 |           917 |           9.8
 2014 |       172 |          1042 |          16.5
 2015 |       258 |          1157 |          22.2
 2016 |       400 |          1265 |          31.6
 2017 |       497 |          1363 |          36.4
 2018 |       633 |          1460 |          43.3
 2019 |       760 |          1563 |          48.6
 2020 |       789 |          1577 |          50.0
ValWood commented 4 years ago

Removed next. Would be nice to add this visual to the stats page, but no urgence

ValWood commented 3 years ago

All papers are triaged and assigned out up to yesterday so the response rate has dropped a little to 50.5% (it was 51% yesterday)

Anyway, this item is very non urgent (it predated the CC paper and we included such a graph) I'm putting as future. Should it be on the website tracker instead?

ValWood commented 9 months ago

53.9% still increasing It seems that this is largely done, so a graph could be added to this page: https://curation.pombase.org/pombe/stats/annotation

kimrutherford commented 7 months ago

Latest query result:

 year | submitted | sent_sessions | response_rate 
------+-----------+---------------+---------------
 2013 |        88 |          1233 |           7.1
 2014 |       169 |          1330 |          12.7
 2015 |       253 |          1430 |          17.6
 2016 |       392 |          1513 |          25.9
 2017 |       483 |          1593 |          30.3
 2018 |       615 |          1673 |          36.7
 2019 |       740 |          1748 |          42.3
 2020 |       862 |          1828 |          47.1
 2021 |       982 |          1929 |          50.9
 2022 |      1050 |          1990 |          52.7
 2023 |      1132 |          2072 |          54.6
 2024 |      1136 |          2083 |          54.5
kimrutherford commented 7 months ago

I had the query wrong and it was making a mess of the older sessions.

 year | submitted | sent_sessions | response_rate 
------+-----------+---------------+---------------
 2013 |        88 |           284 |          30.9
 2014 |       169 |           481 |          35.1
 2015 |       253 |           693 |          36.5
 2016 |       392 |           896 |          43.7
 2017 |       483 |          1088 |          44.3
 2018 |       615 |          1272 |          48.3
 2019 |       740 |          1448 |          51.1
 2020 |       862 |          1624 |            53
 2021 |       982 |          1817 |            54
 2022 |      1050 |          1930 |          54.4
 2023 |      1132 |          2068 |          54.7
 2024 |      1136 |          2081 |          54.5
kimrutherford commented 7 months ago

I've added a curation response rate graph. Hopefully it will be on the main site in the morning but I've just had to restart the load so we'll see.

In the meantime it available on my desktop version: https://desktop.kmr.nz/curation_stats

image

kimrutherford commented 7 months ago

Hopefully it will be on the main site in the morning but I've just had to restart the load so we'll see.

The load finished after a few false starts. GitHub was returning errors when the load script trying to check for the latest Mondo.

https://pombase.org/curation_stats

I had the query wrong and it was making a mess of the older sessions.

I'm still not 100% sure I have it right so I plan to check it again tomorrow after a good sleep. :-)

ValWood commented 7 months ago

Great! we are realt flatlining. I'tt get this going again when PAscal starts

Can we make the graph start earlier ? (2012)

Also the graph doesn't match the early years to this one (30% is high for 2013), is this definitely 1st submission, or 1st approval data?

Screenshot 2024-02-12 at 07 51 31
kimrutherford commented 7 months ago

Can we make the graph start earlier ? (2012)

Unfortunately the date stamps needed from Canto only go back to mid 2013.

is this definitely 1st submission, or 1st approval data?

It's calculated using the submitted date. It does that so that it matches the Canto stats page which uses the number of submitted sessions.

kimrutherford commented 7 months ago

I'm going to look at this again in the morning because I've just spotted another problem. Currently it counts submitted sessions up to a given year and then divides by sessions sent out up to the same year. But it's going to get this wrong for sessions that were submitted in a different year to the year they were sent out. There are quite a few of those. Whoops.

kimrutherford commented 7 months ago

Should the years in the graph be the year sent out or the year submitted? Or year approved?

ValWood commented 7 months ago

submitted I think (the gap between submission and 1st approval should be less than a week 90% of the time so these numbers should be very similar)

ValWood commented 7 months ago

Unfortunately the date stamps needed from Canto only go back to mid 2013.

OK- the numbers are definitely different from the curation paper graph

kimrutherford commented 7 months ago

the numbers are definitely different from the curation paper graph

I think the graph from the paper might be wrong but let's have a chat about this on the next call.

I've double checked the query that generates the current graph and I think it's correct. But it could be that it's not asking the right question. https://pombase.org/curation_stats

kimrutherford commented 7 months ago

For Kim: find backup from December 2012 to add response rate for that year

kimrutherford commented 7 months ago

find backup from December 2012 to add response rate for that year

After a bit of digging, the response rate for 2012 was 91.6%

There were 12 community sessions sent out and 11 were submitted. Did you send them to people you knew would respond?

 year | submitted_for_approval_count | sent_or_accepted_count | response_rate 
------+------------------------------+------------------------+---------------
 2012 |                           11 |                     12 |          91.6
 2013 |                           90 |                    280 |          32.1
 2014 |                          171 |                    480 |          35.6
 2015 |                          255 |                    695 |          36.6
 2016 |                          392 |                    899 |          43.6
 2017 |                          483 |                   1092 |          44.2
 2018 |                          616 |                   1276 |          48.2
 2019 |                          745 |                   1452 |          51.3
 2020 |                          869 |                   1628 |          53.3
 2021 |                          990 |                   1821 |          54.3
 2022 |                         1058 |                   1934 |          54.7
 2023 |                         1141 |                   2072 |            55
 2024 |                         1144 |                   2085 |          54.8
ValWood commented 7 months ago

Yes, I think that was probably the pilot project sessions. I put them all through later as community curated (or we changed them to community curated), I don't quite remember. Maybe we begin with 2013 when we started properly

kimrutherford commented 7 months ago

I'll close this as it's getting long and I think it's done.