pombase / website

PomBase website v2
MIT License
6 stars 1 forks source link

total annotation metric #1080

Closed ValWood closed 5 years ago

ValWood commented 5 years ago

To help to explain the different data-types https://github.com/pombase/pombase-chado/issues/708

we need to add another graph to the metrics page:

we have "Annotations/genes for low throughput publication in 5-year intervals"

but we also need Annotations/genes for high-throughput (or total annotations) throughput publication in 5-year intervals.

It will be much easier to explain the difference in growth if small scale and large scale can be viewed side by side.

kimrutherford commented 5 years ago

Does this need doing before the next grant application? I just want to be clear that is what the label "before next grant" means rather than "before next grant starts".

ValWood commented 5 years ago

Before the application (main, April, not preliminary, jan)

kimrutherford commented 5 years ago

Now that the annotations in Chado (almost) all have a HTP/LTP flag I've been able to add a HTP annotation graph. I haven't worked out how to put LTP and HTP side by side on the same graph yet so it's separate for now.

The average annotations per publication is lower in the new LTP annotation graph.

I think there are two reasons. The main one is that the current graph on the stats page only includes annotations from Canto and assumes all Canto annotations are LTP. The new LTP graph below includes annotation from the contig files. Maybe including the config file annotations would lower the numbers?

The second reason is that some interaction annotations from Canto have been merged during the load with identical annotations from BioGRID. BioGRID has some of them as HTP so they have been moved to the HTP graph.

I'm still checking the numbers so this isn't on the main site yet:

per-pub-annotations-1

mah11 commented 5 years ago

Maybe including the contig file annotations would lower the numbers?

I bet it would, because that would add a whole bunch of publications that have annotations, but that we would now regard as incompletely curated. Almost all of the papers in the contig files get more annotations when we revisit them in Canto.

ValWood commented 5 years ago

I'm impressed that the drop was not larger (it is a substantial drop but it isn't an order of magnitude!)

Including contig data will only lower the pre 2010 number though so that's OK (i.e the current numbers will stay the same.

Interesting thet HTP is now dropping. I guess most of the genome -wide that will give annotatable results were done over 5 years ago (deletion collection etc), and now there are lots of conditional screens giving smaller results sets. I thought this would be the case, but not so clearly- since I thought some expression sets would swamp this effect out (maybe most of those are also over 5 years old now too!...time flies)

Note in the graph labelling we should make it clear that this is "genes providing annotation, per publication" (more genes would be included in the experiement often)

mah11 commented 5 years ago

I thought some expression sets would swamp this effect out

also, maybe more recent genome-wide work is generating browser tracks rather than Canto-able data - it wouldn't surprise me if that's where the expression datasets are going

kimrutherford commented 5 years ago

Including contig data will only lower the pre 2010 number though so that's OK (i.e the current numbers will stay the same.

The 2017-2019 dropped quite a lot too.

ValWood commented 5 years ago

I'm only thinking about the qualititive quantitative expression....

ValWood commented 5 years ago

The 2017-2019 dropped quite a lot too.

Ah I thought you meant you still needed to include the contig data in the figures above?

kimrutherford commented 5 years ago

Ah I thought you meant you still needed to include the contig data in the figures above?

The graphs include all contig, Canto and GOA annotation. I've fixed a few HTP vs LTP issues so I'll see what the graphs look like tomorrow morning.

ValWood commented 5 years ago

OK, looks fine then. I expected a big drop because if a paper was in canto and had HTP it was classed as HTP (I think)

kimrutherford commented 5 years ago

OK, looks fine then. I expected a big drop because if a paper was in canto and had HTP it was classed as HTP (I think)

The current graphs on the Canto pages show everything from Canto. Because the annotations are now marked as LTP or HTP in Chado we're able to take the HTP Canto annotation out of the LTP graph. It's only a quite small number of HTP annotations from Canto though all of which are interactions.

kimrutherford commented 5 years ago

I've fixed a few HTP vs LTP issues so I'll see what the graphs look like tomorrow morning.

I forgot the code to add a throughput flag to all the contig file annotations, it was only doing FYPO and GO. That's fixed for tomorrow's load.

Unfortunately I broke last night's load with yesterday's changes. I've fixed the problems and it works for me locally so I think tonight's load should be fine.

kimrutherford commented 5 years ago

I think almost all the annotations from all the sources now have a throughput flag (high, low or non-experimental).

The last hold outs are 153 species_dist annotations that I can't find. I'm not too worried as I was only going to flag them as "non-experimenal". So they won't be included in the stats.

kimrutherford commented 5 years ago

The split HTP/LTP annotations per paper graphs are now on the main site at the bottom of the stats page: https://curation.pombase.org/pombe/stats/annotation

There are now only 179 out of 364,500 annotation that don't have a throughput type attached to them in Chado. I'll chip away at those but I think the graphs are accurate now.

kimrutherford commented 5 years ago

I've edited the paragraph above the graphs to mention the HTP/LTP split. Let me know what changes you'd like or edit it directly here: https://github.com/pombase/canto/blob/master/root/stats/annotation.mhtml#L438

This is what I've changed the text to:

"Mean number of manually curated annotations and genes annotated in PomBase per peer-reviewed paper in 5-year intervals. Annotations are split into separate graphs for annotations from low-throughput experiments and high-throughput experiments."

ValWood commented 5 years ago

Average low-throughput annotations per publication: Average high-throughput annotations per publication:

Is also not quite right. This isn't across ALL publications.

So it's Average low-throughput annotations per publication from publications which have any LTP annotation (and the same for HTP).

Can we think of a succinct way to say this?

ValWood commented 5 years ago

Average annotations per publication, publications containing low-throughput data ?

ValWood commented 5 years ago

we should also check if the LTP pre 1970 is really LTP!

kimrutherford commented 5 years ago

This isn't across ALL publications.

So it's Average low-throughput annotations per publication from publications which have any LTP annotation (and the same for HTP).

Currently the graphs show the averages for all publications that have any annotation.

Do we need to exclude annotations from HTP-only publications from both graphs?

ValWood commented 5 years ago

Ah OK then the graphs are correct if across all publications... will have a think... Keep as it is for now... I'm off today.

ValWood commented 5 years ago

Any thoughts anyone? Are the current graphs the best way to show this?

@kimrutherford which publications are pre 2000 HTP data?

kimrutherford commented 5 years ago

which publications are pre 2000 HTP data?

They are all phenotype annotations. The ones I checked are from PHAF files in SVN so they might change to LTP when we split the PHAF directory into LTP and HTP sections.

     pmid      | annotation_count 
---------------+------------------
 PMID:8390662  |               41
 PMID:9003295  |               71
 PMID:9649519  |              211
 PMID:8065904  |               45
 PMID:166019   |               44
 PMID:8665408  |               39
 PMID:10449724 |               79
 PMID:1332977  |               56
 PMID:10079327 |               68
 PMID:9658208  |              147
 PMID:1315954  |              145
 PMID:8663159  |               37
 PMID:2657742  |               54
 PMID:7969124  |               29
 PMID:9917066  |               39
 PMID:9563836  |              179
 PMID:8382769  |               17
ValWood commented 5 years ago

most seem to be correctly classified, even the early ones are "geneome wide screens"

kimrutherford commented 5 years ago

most seem to be correctly classified, even the early ones are "geneome wide screens"

Now that the PHAF files are split into LTP and HTP directories, all of the early phenotype annotations are now classed as LTP. So the HTP graph has changed a bit: ltp-htp-graphs-1

ValWood commented 5 years ago

Interesting!

ValWood commented 5 years ago

Can we clearly label the complementary "genes" bar chart above these in the stats page as "LTP"

kimrutherford commented 5 years ago

Can we clearly label the complementary "genes" bar chart above these in the stats page as "LTP"

Is this too wordy?: "Average genes from low throughput experiments per publication"

ValWood commented 5 years ago

I think it's OK.

Maybe Average genes from low throughput experiments/per publication

Makes it clearer how it translates graphically?

kimrutherford commented 5 years ago

Average genes from low throughput experiments/per publication

OK, I've changed it to that.

ValWood commented 5 years ago

One thing I still need is the number of papers with HTP data. So the total identified as HTP here and I'll add this too the "sequence browser hosted" ones. It would be useful to have this number in the stats lise (at least the ones in this graph)

I will use the number through canto for LTP

kimrutherford commented 5 years ago

One thing I still need is the number of papers with HTP data.

Here are the counts of HTP files that we load: quantitative gene ex: 5 qualitative gene ex: 1 modifications: 6 phenotype: 41 total: 53

I also queried Chado for any publication that has any HTP annotations. That didn't work out well. There are a bunch of publications where just a handful of annotations are marked as HTP. An example is: https://www.pombase.org/reference/PMID:21436456 all the interactions from that publication are marked as LTP in Chado except the interaction with the "Two-hybrid" evidence code. That interaction comes from BioGRID and they've marked it as HTP. I'm not sure it makes sense to count that paper as a HTP paper.

So instead I did some querying for publications with at least a certain number of HTP annotations. There are 86 publications in Chado with at least 50 HTP annotations. And there are 66 with at least 100 HTP annotations.

ValWood commented 5 years ago

Do you have any more examples like this? https://www.pombase.org/reference/PMID:21436456 I will ask bioGRID to fix.

I think for our stats for the grant 53 is a good number.

kimrutherford commented 5 years ago

Do you have any more examples like this?

PMID:24497846 has a bunch of LTP annotations and only two HTP annotations.

PMID:15809031 has 17 interactions but only three HTP interactions

PMID:21767457 has four interactions, two LTP, one HTP, and one is both.

ValWood commented 5 years ago

These might be correct. PMID:24497846 has a genetic interaction screen, but the screen maybe only identified a couple of results. Difficult to know if not looked in detail. For our purposes I think we would classify as LTP though...

ValWood commented 5 years ago

I think we can close this and open new tickets. I scanned and I can't see anything outstanding.