Closed ValWood closed 3 years ago
so 10,000 less
There are about 10,000 annotations that don't have dates. The log file includes those but the stats page doesn't. Does that explain it?
yep!
Annotations without a date could be given the earliest data as a proxy I guess so they are included in the total?
Add a pre 2000 column which has everything which does not have a date.
Just to double check, is this in the "Cumulative annotations by type and year" section of the stats page?
That's it!
There are about 10,000 annotations that don't have dates
11218 currently. Here's the breakdown:
cat_act | 2
subunit_composition | 5
external_link | 6
m_f_g | 16
pathway | 18
complementation | 22
PSI-MOD | 25
misc | 33
gene_ex | 37
interacts_physically | 54
name_description | 62
genome_org | 78
interacts_genetically | 158
fission_yeast_phenotype | 226
warning | 286
species_dist | 753
sequence | 787
EC numbers | 835
disease_ontology | 1283
PomBase family or domain | 1394
PomBase gene characterisation status | 5138
Are the any that should be included in the graphs.
It looks like this with the undated annotations added:
Maybe the graph should start at 2004 like this:
Yes, I think starting the graph in 2004 is sensible
for the stats I think we can ignore these but could you put them in a curation tracker ticket so we can delete if appropiate, of figure out what they should be?
cat_act | 2 subunit_composition | 5 external_link | 6 pathway | 18
For these, I suggest we just put the year before the start of the use of "date"? above, so 1999?
complementation | 22 PSI-MOD | 25 m_f_g | 16 misc | 33 gene_ex | 37 ~interacts_physically | 54 WHAT IS THIS ONE? WE DONT DO THESE IN ART?~ name_description | 62 genome_org | 78 ~interacts_genetically | 158 WHAT IS THIS ONE? WE DONT DO THESE IN ART?~ ~fission_yeast_phenotype | 226~ warning | 286 species_dist | 753 ~sequence | 787 WHAT IS THIS ONE?~ EC numbers | 835 ~disease_ontology | 1283 WE HAVE A SEPARATE TABLE WITH THESE RIGHT? SO THESE WOULD BE INCLUDED TWICE ID COUNTED HERE?~ PomBase family or domain | 1394 PomBase gene characterisation status | 5138
For these, I suggest we just put the year before the start of the use of "date"? above, so 1999?
Do you mean put "1999" instead of "<2000" under the bar?
for the stats I think we can ignore these
I've changed the code to ignore they types you listed.
interacts_physically | 54 WHAT IS THIS ONE? WE DONT DO THESE IN ART? interacts_genetically | 158 WHAT IS THIS ONE? WE DONT DO THESE IN ART?
It looks like these come from two publications, PMID:19111658 and PMID:25795664 which are loaded from flat files from Subversion:
pombe-embl/external_data/interactions/PMID_19111658_interactions.tab2.txt pombe-embl/external_data/interactions/PMID_25795664_scored_interactions.tab2.txt
The files are in the BioGRID format which doesn't have a column for annotation date.
sequence | 787 WHAT IS THIS ONE?
That's the sequence ontology from annotations like:
FT /controlled_curation="term=sequence feature, N-terminal
FT signal sequence; qualifier=predicted; cv=sequence_feature;
FT date=19700101"
Should we filter those out?
disease_ontology | 1283 WE HAVE A SEPARATE TABLE WITH THESE RIGHT? SO THESE WOULD BE INCLUDED TWICE ID COUNTED HERE?
They are only included once in the Cumulative annotations numbers.
interacts_physically | 54 interacts_genetically | 158
It looks like these come from two publications, PMID:19111658 and PMID:25795664 which are loaded from flat files from Subversion:
pombe-embl/external_data/interactions/PMID_19111658_interactions.tab2.txt pombe-embl/external_data/interactions/PMID_25795664_scored_interactions.tab2.txt
In that case, we can assign them the dates the files were committed in svn. (I leave the question of technical feasibility to Kim :P )
In that case, we can assign them the dates the files were committed in svn. (I leave the question of technical feasibility to Kim :P )
I'll work something out.
Yes, I think starting the graph in 2004 is sensible
I'll do that.
For these, I suggest we just put the year before the start of the use of "date"? above, so 1999?
Do you mean put "1999" instead of "<2000" under the bar?
No. Sorry I am confusing things here.I meant lump all of the non-date-stamped remaining in the contig files at an early date. Maybe the pombase start date (2010) because none should post-date this. Would that be terrible?
That's the sequence ontology from annotations like: FT /controlled_curation="term=sequence feature, N-terminal FT signal sequence; qualifier=predicted; cv=sequence_feature; FT date=19700101" Should we filter those out?
They are valid and we still use them so we should keep them in.
Actually do we display those protein sequence features. Maybe we don't . I haven't seen one for ages. We should...
disease_ontology | 1283 WE HAVE A SEPARATE TABLE WITH THESE RIGHT? SO THESE WOULD BE INCLUDED TWICE ID COUNTED HERE? They are only included once in the Cumulative annotations numbers.
But basically the ones in artemis can be ignored, because we now store them externally. Correct? If this is the case we could probably delete t hese from artemis, if I can remember how.....
Actually do we display those protein sequence features?
They get converted to SO annotations based on a mapping file I maintain in svn. The example you pasted turns into "SO:0000418 signal_peptide ISS" on the gene page.
I forgot this. @Antonialock you can get pombe signal peptides this way. Might be useful for a fungiiDB comparative exercise.
In that case, we can assign them the dates the files were committed in svn.
Getting the date from SVN is proving difficult because I can't access SVN. So for now I've changed the loader to use the date stamp from the file.
In that case, we can assign them the dates the [interaction] files were committed in svn.
Getting the date from SVN is proving difficult because I can't access SVN. So for now I've changed the loader to use the date stamp from the file.
I am such a hopeless pack-rat that I have saved enough log info for this:
PMID:19111658 was in r5078 on 2018-09-21.
For PMID:25795664 I would use 2016-08-05 (r3562) because that's when I first put these interactions into any file. They didn't go live until we put them into a more correct BioGRID format on 2018-08-01 (r4941), but they really got curated on the earlier date.
impressive Saving it is one thing, but I still can't figure out how you always manage to find the damn stuff.....
I have saved enough log info for this:
Thanks. I've added those dates to the load script.
This issue might be solved.
Yes, I think starting the graph in 2004 is sensible
That was done a while ago.
Thanks. I've added those dates to the load script.
The interactions that were missing a date now use Midori's saved dates.
But basically the ones in artemis can be ignored, because we now store them externally. Correct? If this is the case we could probably delete t hese from artemis, if I can remember how.....
Are there still disease annotations in the contig files?
Here's the breakdown:
Here's an updated table of annotations without dates:
count | annotation_type
-------+--------------------------------------
2 | cat_act
5 | subunit_composition
6 | external_link
16 | m_f_g
18 | pathway
22 | complementation
25 | PSI-MOD
34 | misc
37 | gene_ex
62 | name_description
78 | genome_org
209 | fission_yeast_phenotype
284 | warning
748 | species_dist
787 | sequence
833 | EC numbers
1242 | disease_ontology
1394 | PomBase family or domain
5136 | PomBase gene characterisation status
Do we need to investigate why there are fission_yeast_phenotype and disease_ontology annotations without dates?
Do we need to investigate why there are fission_yeast_phenotype and disease_ontology annotations without dates?
Which are the phenotype annotations?
I am guessing the disease associations will all get fixed when we migrate to Mondo. What are the source of the dateless ones?
Which are the phenotype annotations?
I guess they come from the contig files? I haven't dug into it yet.
I am guessing the disease associations will all get fixed when we migrate to Mondo.
I think so. We should check again after that's done.
What are the source of the dateless ones?
Maybe from MalaCards?
Which are the phenotype annotations?
I guess they come from the contig files? I haven't dug into it yet.
I checked and all the phenotype annotations with no date from from the contig files.
OK all makes sense.
I checked and all the phenotype annotations with no date from the contig files.
OK, I thought they had a date field. When I annotate the paper I delete them in Art and remake them in Canto so that they are editable. They should all disappear eventually...
I wonder if they are all the same phenotype?
Mostly just two. Here are the names and counts of the terms in the annotations:
name | count
------------------------------------------------------------------+-------
inviable vegetative cell population | 48
mutator | 1
abolished DNA damage checkpoint override in response to caffeine | 1
sensitive to methyl methanesulfonate | 1
elongated multiseptate vegetative cell | 1
viable vegetative cell population | 157
Oh, right these cam from lots of different papers. Me and jacky dug them all out because we wanted to compare the deletions to all the known viability data. This means I did them some time in 2009..... we can look at these again in a year or so, hopefully most will have disappeared...
Not urgent to assigned to myself to summarize
closing...
I wonder if we are still not capturing ALL annotation in the stats page.
In the canto logs here: https://curation.pombase.org/dumps/latest_build/logs/log.2016-09-24-13-40-13.annotation_counts_by_cv we have 210994 annotations
On the stats page we have around 200,000 total https://curation.pombase.org/pombe/stats/annotation so 10,000 less
So the Chado log numbers (if I understood correctly) only include ontology annotations, and so they exclude the genetic and physical interactions. If so, this would make the difference even larger.
The Chado log numbers DO include the GO IEA annotation which are excluded from the the stats page, but this is only around 3000 annotations. So I can't reconcile the numbers from the 2 sources...