Closed ValWood closed 5 years ago
Does this need doing before the next grant application? I just want to be clear that is what the label "before next grant" means rather than "before next grant starts".
Before the application (main, April, not preliminary, jan)
Now that the annotations in Chado (almost) all have a HTP/LTP flag I've been able to add a HTP annotation graph. I haven't worked out how to put LTP and HTP side by side on the same graph yet so it's separate for now.
The average annotations per publication is lower in the new LTP annotation graph.
I think there are two reasons. The main one is that the current graph on the stats page only includes annotations from Canto and assumes all Canto annotations are LTP. The new LTP graph below includes annotation from the contig files. Maybe including the config file annotations would lower the numbers?
The second reason is that some interaction annotations from Canto have been merged during the load with identical annotations from BioGRID. BioGRID has some of them as HTP so they have been moved to the HTP graph.
I'm still checking the numbers so this isn't on the main site yet:
Maybe including the contig file annotations would lower the numbers?
I bet it would, because that would add a whole bunch of publications that have annotations, but that we would now regard as incompletely curated. Almost all of the papers in the contig files get more annotations when we revisit them in Canto.
I'm impressed that the drop was not larger (it is a substantial drop but it isn't an order of magnitude!)
Including contig data will only lower the pre 2010 number though so that's OK (i.e the current numbers will stay the same.
Interesting thet HTP is now dropping. I guess most of the genome -wide that will give annotatable results were done over 5 years ago (deletion collection etc), and now there are lots of conditional screens giving smaller results sets. I thought this would be the case, but not so clearly- since I thought some expression sets would swamp this effect out (maybe most of those are also over 5 years old now too!...time flies)
Note in the graph labelling we should make it clear that this is "genes providing annotation, per publication" (more genes would be included in the experiement often)
I thought some expression sets would swamp this effect out
also, maybe more recent genome-wide work is generating browser tracks rather than Canto-able data - it wouldn't surprise me if that's where the expression datasets are going
Including contig data will only lower the pre 2010 number though so that's OK (i.e the current numbers will stay the same.
The 2017-2019 dropped quite a lot too.
I'm only thinking about the qualititive quantitative expression....
The 2017-2019 dropped quite a lot too.
Ah I thought you meant you still needed to include the contig data in the figures above?
Ah I thought you meant you still needed to include the contig data in the figures above?
The graphs include all contig, Canto and GOA annotation. I've fixed a few HTP vs LTP issues so I'll see what the graphs look like tomorrow morning.
OK, looks fine then. I expected a big drop because if a paper was in canto and had HTP it was classed as HTP (I think)
OK, looks fine then. I expected a big drop because if a paper was in canto and had HTP it was classed as HTP (I think)
The current graphs on the Canto pages show everything from Canto. Because the annotations are now marked as LTP or HTP in Chado we're able to take the HTP Canto annotation out of the LTP graph. It's only a quite small number of HTP annotations from Canto though all of which are interactions.
I've fixed a few HTP vs LTP issues so I'll see what the graphs look like tomorrow morning.
I forgot the code to add a throughput flag to all the contig file annotations, it was only doing FYPO and GO. That's fixed for tomorrow's load.
Unfortunately I broke last night's load with yesterday's changes. I've fixed the problems and it works for me locally so I think tonight's load should be fine.
I think almost all the annotations from all the sources now have a throughput flag (high, low or non-experimental).
The last hold outs are 153 species_dist annotations that I can't find. I'm not too worried as I was only going to flag them as "non-experimenal". So they won't be included in the stats.
The split HTP/LTP annotations per paper graphs are now on the main site at the bottom of the stats page: https://curation.pombase.org/pombe/stats/annotation
There are now only 179 out of 364,500 annotation that don't have a throughput type attached to them in Chado. I'll chip away at those but I think the graphs are accurate now.
I've edited the paragraph above the graphs to mention the HTP/LTP split. Let me know what changes you'd like or edit it directly here: https://github.com/pombase/canto/blob/master/root/stats/annotation.mhtml#L438
This is what I've changed the text to:
"Mean number of manually curated annotations and genes annotated in PomBase per peer-reviewed paper in 5-year intervals. Annotations are split into separate graphs for annotations from low-throughput experiments and high-throughput experiments."
Average low-throughput annotations per publication: Average high-throughput annotations per publication:
Is also not quite right. This isn't across ALL publications.
So it's Average low-throughput annotations per publication from publications which have any LTP annotation (and the same for HTP).
Can we think of a succinct way to say this?
Average annotations per publication, publications containing low-throughput data ?
we should also check if the LTP pre 1970 is really LTP!
This isn't across ALL publications.
So it's Average low-throughput annotations per publication from publications which have any LTP annotation (and the same for HTP).
Currently the graphs show the averages for all publications that have any annotation.
Do we need to exclude annotations from HTP-only publications from both graphs?
Ah OK then the graphs are correct if across all publications... will have a think... Keep as it is for now... I'm off today.
Any thoughts anyone? Are the current graphs the best way to show this?
@kimrutherford which publications are pre 2000 HTP data?
which publications are pre 2000 HTP data?
They are all phenotype annotations. The ones I checked are from PHAF files in SVN so they might change to LTP when we split the PHAF directory into LTP and HTP sections.
pmid | annotation_count
---------------+------------------
PMID:8390662 | 41
PMID:9003295 | 71
PMID:9649519 | 211
PMID:8065904 | 45
PMID:166019 | 44
PMID:8665408 | 39
PMID:10449724 | 79
PMID:1332977 | 56
PMID:10079327 | 68
PMID:9658208 | 147
PMID:1315954 | 145
PMID:8663159 | 37
PMID:2657742 | 54
PMID:7969124 | 29
PMID:9917066 | 39
PMID:9563836 | 179
PMID:8382769 | 17
most seem to be correctly classified, even the early ones are "geneome wide screens"
most seem to be correctly classified, even the early ones are "geneome wide screens"
Now that the PHAF files are split into LTP and HTP directories, all of the early phenotype annotations are now classed as LTP. So the HTP graph has changed a bit:
Interesting!
Can we clearly label the complementary "genes" bar chart above these in the stats page as "LTP"
Can we clearly label the complementary "genes" bar chart above these in the stats page as "LTP"
Is this too wordy?: "Average genes from low throughput experiments per publication"
I think it's OK.
Maybe Average genes from low throughput experiments/per publication
Makes it clearer how it translates graphically?
Average genes from low throughput experiments/per publication
OK, I've changed it to that.
One thing I still need is the number of papers with HTP data. So the total identified as HTP here and I'll add this too the "sequence browser hosted" ones. It would be useful to have this number in the stats lise (at least the ones in this graph)
I will use the number through canto for LTP
One thing I still need is the number of papers with HTP data.
Here are the counts of HTP files that we load: quantitative gene ex: 5 qualitative gene ex: 1 modifications: 6 phenotype: 41 total: 53
I also queried Chado for any publication that has any HTP annotations. That didn't work out well. There are a bunch of publications where just a handful of annotations are marked as HTP. An example is: https://www.pombase.org/reference/PMID:21436456 all the interactions from that publication are marked as LTP in Chado except the interaction with the "Two-hybrid" evidence code. That interaction comes from BioGRID and they've marked it as HTP. I'm not sure it makes sense to count that paper as a HTP paper.
So instead I did some querying for publications with at least a certain number of HTP annotations. There are 86 publications in Chado with at least 50 HTP annotations. And there are 66 with at least 100 HTP annotations.
Do you have any more examples like this? https://www.pombase.org/reference/PMID:21436456 I will ask bioGRID to fix.
I think for our stats for the grant 53 is a good number.
Do you have any more examples like this?
PMID:24497846 has a bunch of LTP annotations and only two HTP annotations.
PMID:15809031 has 17 interactions but only three HTP interactions
PMID:21767457 has four interactions, two LTP, one HTP, and one is both.
These might be correct. PMID:24497846 has a genetic interaction screen, but the screen maybe only identified a couple of results. Difficult to know if not looked in detail. For our purposes I think we would classify as LTP though...
I think we can close this and open new tickets. I scanned and I can't see anything outstanding.
To help to explain the different data-types https://github.com/pombase/pombase-chado/issues/708
we need to add another graph to the metrics page:
we have "Annotations/genes for low throughput publication in 5-year intervals"
but we also need Annotations/genes for high-throughput (or total annotations) throughput publication in 5-year intervals.
It will be much easier to explain the difference in growth if small scale and large scale can be viewed side by side.