pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

stats page numbers query #587

Closed ValWood closed 3 years ago

ValWood commented 7 years ago

I wonder if we are still not capturing ALL annotation in the stats page.

In the canto logs here: https://curation.pombase.org/dumps/latest_build/logs/log.2016-09-24-13-40-13.annotation_counts_by_cv we have 210994 annotations

On the stats page we have around 200,000 total https://curation.pombase.org/pombe/stats/annotation so 10,000 less

So the Chado log numbers (if I understood correctly) only include ontology annotations, and so they exclude the genetic and physical interactions. If so, this would make the difference even larger.

The Chado log numbers DO include the GO IEA annotation which are excluded from the the stats page, but this is only around 3000 annotations. So I can't reconcile the numbers from the 2 sources...

fypoadmin commented 7 years ago

so 10,000 less

There are about 10,000 annotations that don't have dates. The log file includes those but the stats page doesn't. Does that explain it?

ValWood commented 7 years ago

yep!

ValWood commented 6 years ago

Annotations without a date could be given the earliest data as a proxy I guess so they are included in the total?

ValWood commented 6 years ago

Add a pre 2000 column which has everything which does not have a date.

kimrutherford commented 5 years ago

Just to double check, is this in the "Cumulative annotations by type and year" section of the stats page?

ValWood commented 5 years ago

That's it!

kimrutherford commented 5 years ago

There are about 10,000 annotations that don't have dates

11218 currently. Here's the breakdown:

 cat_act                              |     2
 subunit_composition                  |     5
 external_link                        |     6
 m_f_g                                |    16
 pathway                              |    18
 complementation                      |    22
 PSI-MOD                              |    25
 misc                                 |    33
 gene_ex                              |    37
 interacts_physically                 |    54
 name_description                     |    62
 genome_org                           |    78
 interacts_genetically                |   158
 fission_yeast_phenotype              |   226
 warning                              |   286
 species_dist                         |   753
 sequence                             |   787
 EC numbers                           |   835
 disease_ontology                     |  1283
 PomBase family or domain             |  1394
 PomBase gene characterisation status |  5138

Are the any that should be included in the graphs.

It looks like this with the undated annotations added: cumulative-annotations-1

Maybe the graph should start at 2004 like this: cumulative-annotations-2

ValWood commented 5 years ago

Yes, I think starting the graph in 2004 is sensible

ValWood commented 5 years ago

for the stats I think we can ignore these but could you put them in a curation tracker ticket so we can delete if appropiate, of figure out what they should be?

cat_act | 2 subunit_composition | 5 external_link | 6 pathway | 18

For these, I suggest we just put the year before the start of the use of "date"? above, so 1999?

complementation | 22 PSI-MOD | 25 m_f_g | 16 misc | 33 gene_ex | 37 ~interacts_physically | 54 WHAT IS THIS ONE? WE DONT DO THESE IN ART?~ name_description | 62 genome_org | 78 ~interacts_genetically | 158 WHAT IS THIS ONE? WE DONT DO THESE IN ART?~ ~fission_yeast_phenotype | 226~ warning | 286 species_dist | 753 ~sequence | 787 WHAT IS THIS ONE?~ EC numbers | 835 ~disease_ontology | 1283 WE HAVE A SEPARATE TABLE WITH THESE RIGHT? SO THESE WOULD BE INCLUDED TWICE ID COUNTED HERE?~ PomBase family or domain | 1394 PomBase gene characterisation status | 5138

kimrutherford commented 5 years ago

For these, I suggest we just put the year before the start of the use of "date"? above, so 1999?

Do you mean put "1999" instead of "<2000" under the bar?

kimrutherford commented 5 years ago

for the stats I think we can ignore these

I've changed the code to ignore they types you listed.

interacts_physically | 54 WHAT IS THIS ONE? WE DONT DO THESE IN ART? interacts_genetically | 158 WHAT IS THIS ONE? WE DONT DO THESE IN ART?

It looks like these come from two publications, PMID:19111658 and PMID:25795664 which are loaded from flat files from Subversion:

pombe-embl/external_data/interactions/PMID_19111658_interactions.tab2.txt pombe-embl/external_data/interactions/PMID_25795664_scored_interactions.tab2.txt

The files are in the BioGRID format which doesn't have a column for annotation date.

sequence | 787 WHAT IS THIS ONE?

That's the sequence ontology from annotations like:

FT                   /controlled_curation="term=sequence feature, N-terminal
FT                   signal sequence; qualifier=predicted; cv=sequence_feature;
FT                   date=19700101"

Should we filter those out?

disease_ontology | 1283 WE HAVE A SEPARATE TABLE WITH THESE RIGHT? SO THESE WOULD BE INCLUDED TWICE ID COUNTED HERE?

They are only included once in the Cumulative annotations numbers.

mah11 commented 5 years ago

interacts_physically | 54 interacts_genetically | 158

It looks like these come from two publications, PMID:19111658 and PMID:25795664 which are loaded from flat files from Subversion:

pombe-embl/external_data/interactions/PMID_19111658_interactions.tab2.txt pombe-embl/external_data/interactions/PMID_25795664_scored_interactions.tab2.txt

In that case, we can assign them the dates the files were committed in svn. (I leave the question of technical feasibility to Kim :P )

kimrutherford commented 5 years ago

In that case, we can assign them the dates the files were committed in svn. (I leave the question of technical feasibility to Kim :P )

I'll work something out.

Yes, I think starting the graph in 2004 is sensible

I'll do that.

ValWood commented 5 years ago

For these, I suggest we just put the year before the start of the use of "date"? above, so 1999?

Do you mean put "1999" instead of "<2000" under the bar?

No. Sorry I am confusing things here.I meant lump all of the non-date-stamped remaining in the contig files at an early date. Maybe the pombase start date (2010) because none should post-date this. Would that be terrible?

ValWood commented 5 years ago

That's the sequence ontology from annotations like: FT /controlled_curation="term=sequence feature, N-terminal FT signal sequence; qualifier=predicted; cv=sequence_feature; FT date=19700101" Should we filter those out?

They are valid and we still use them so we should keep them in.

ValWood commented 5 years ago

Actually do we display those protein sequence features. Maybe we don't . I haven't seen one for ages. We should...

ValWood commented 5 years ago

disease_ontology | 1283 WE HAVE A SEPARATE TABLE WITH THESE RIGHT? SO THESE WOULD BE INCLUDED TWICE ID COUNTED HERE? They are only included once in the Cumulative annotations numbers.

But basically the ones in artemis can be ignored, because we now store them externally. Correct? If this is the case we could probably delete t hese from artemis, if I can remember how.....

mah11 commented 5 years ago

Actually do we display those protein sequence features?

They get converted to SO annotations based on a mapping file I maintain in svn. The example you pasted turns into "SO:0000418 signal_peptide ISS" on the gene page.

ValWood commented 5 years ago

I forgot this. @Antonialock you can get pombe signal peptides this way. Might be useful for a fungiiDB comparative exercise.

kimrutherford commented 5 years ago

In that case, we can assign them the dates the files were committed in svn.

Getting the date from SVN is proving difficult because I can't access SVN. So for now I've changed the loader to use the date stamp from the file.

mah11 commented 5 years ago

In that case, we can assign them the dates the [interaction] files were committed in svn.

Getting the date from SVN is proving difficult because I can't access SVN. So for now I've changed the loader to use the date stamp from the file.

I am such a hopeless pack-rat that I have saved enough log info for this:

ValWood commented 5 years ago

impressive Saving it is one thing, but I still can't figure out how you always manage to find the damn stuff.....

kimrutherford commented 5 years ago

I have saved enough log info for this:

Thanks. I've added those dates to the load script.

kimrutherford commented 4 years ago

This issue might be solved.

Yes, I think starting the graph in 2004 is sensible

That was done a while ago.

Thanks. I've added those dates to the load script.

The interactions that were missing a date now use Midori's saved dates.

But basically the ones in artemis can be ignored, because we now store them externally. Correct? If this is the case we could probably delete t hese from artemis, if I can remember how.....

Are there still disease annotations in the contig files?

Here's the breakdown:

Here's an updated table of annotations without dates:

 count |           annotation_type            
-------+--------------------------------------
     2 | cat_act
     5 | subunit_composition
     6 | external_link
    16 | m_f_g
    18 | pathway
    22 | complementation
    25 | PSI-MOD
    34 | misc
    37 | gene_ex
    62 | name_description
    78 | genome_org
   209 | fission_yeast_phenotype
   284 | warning
   748 | species_dist
   787 | sequence
   833 | EC numbers
  1242 | disease_ontology
  1394 | PomBase family or domain
  5136 | PomBase gene characterisation status

Do we need to investigate why there are fission_yeast_phenotype and disease_ontology annotations without dates?

ValWood commented 4 years ago

Do we need to investigate why there are fission_yeast_phenotype and disease_ontology annotations without dates?

Which are the phenotype annotations?

I am guessing the disease associations will all get fixed when we migrate to Mondo. What are the source of the dateless ones?

kimrutherford commented 4 years ago

Which are the phenotype annotations?

I guess they come from the contig files? I haven't dug into it yet.

I am guessing the disease associations will all get fixed when we migrate to Mondo.

I think so. We should check again after that's done.

What are the source of the dateless ones?

Maybe from MalaCards?

kimrutherford commented 4 years ago

Which are the phenotype annotations?

I guess they come from the contig files? I haven't dug into it yet.

I checked and all the phenotype annotations with no date from from the contig files.

ValWood commented 4 years ago

OK all makes sense.

ValWood commented 4 years ago

I checked and all the phenotype annotations with no date from the contig files.

OK, I thought they had a date field. When I annotate the paper I delete them in Art and remake them in Canto so that they are editable. They should all disappear eventually...

kimrutherford commented 4 years ago

I wonder if they are all the same phenotype?

Mostly just two. Here are the names and counts of the terms in the annotations:

                               name                               | count 
------------------------------------------------------------------+-------
 inviable vegetative cell population                              |    48
 mutator                                                          |     1
 abolished DNA damage checkpoint override in response to caffeine |     1
 sensitive to methyl methanesulfonate                             |     1
 elongated multiseptate vegetative cell                           |     1
 viable vegetative cell population                                |   157
ValWood commented 4 years ago

Oh, right these cam from lots of different papers. Me and jacky dug them all out because we wanted to compare the deletions to all the known viability data. This means I did them some time in 2009..... we can look at these again in a year or so, hopefully most will have disappeared...

ValWood commented 4 years ago

Not urgent to assigned to myself to summarize

ValWood commented 3 years ago

closing...