pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Check and log species distribution typos #705

Closed kimrutherford closed 2 years ago

kimrutherford commented 5 years ago

Typos in the species distribution CV annotations need to be logged.

We'll need to configure a list of the valid term names. Here's a list of the current number of terms each term name is used in an annotation:

                  name                  | count 
----------------------------------------+-------
 conserved in archaea                   |   300
 conserved in bacteria                  |  1121
 conserved in eukaryotes                |  4547
 conserved in eukaryotes only           |  2587
 conserved in eukaryoties only          |     1
 conserved in fungi                     |  4640
 conserved in fungi only                |   548
 conserved in metazoa                   |  3549
 conserved in metozoa                   |     4
 conserved in vertebrates               |  3531
 conserved in vertebtates               |     1
 faster evolving duplicate              |    22
 human CETN2 ortholog                   |     1
 no apparent S. cerevisiae ortholog     |   604
 orthologs cannot be distinguished      |   109
 predominantly single copy (one to one) |  3118
 Schizosaccharomyces pombe specific     |   148
 Schizosaccharomyces specific           |   219
ValWood commented 5 years ago

fixed except human CETN2 ortholog

looking into this anyway becasue chromosome1.contig:FT CETN2 ortholog; date=20170902" chromosome3.contig:FT /controlled_curation="term=human CETN1 and CETN2 and CETN3

ValWood commented 5 years ago

Could you re run this query so that I can heck all are fixed. The ticket can stay as low priority for now.

kimrutherford commented 5 years ago

This is from the 2019-01-23 nightly load:

                  name                  | count 
----------------------------------------+-------
 conserved in archaea                   |   300
 conserved in bacteria                  |  1121
 conserved in eukaryotes                |  4546
 conserved in eukaryotes only           |  2588
 conserved in fungi                     |  4640
 conserved in fungi only                |   547
 conserved in metazoa                   |  3554
 conserved in vertebrates               |  3533
 faster evolving duplicate              |    22
 metazoa                                |     1
 no apparent S. cerevisiae ortholog     |   604
 orthologs cannot be distinguished      |   109
 predominantly single copy (one to one) |  3117
 Schizosaccharomyces pombe specific     |   152
 Schizosaccharomyces specific           |   224
 vertebrates                            |     1
(16 rows)
ValWood commented 5 years ago

I fixed the odd ones, will check again in a few months if not implemented

ValWood commented 5 years ago

Could you re-run this query for me so I can check that no errors crept in...

kimrutherford commented 5 years ago

Could you re-run this query for me so I can check that no errors crept in...

Looking good:

                  name                  | count 
----------------------------------------+-------
 conserved in archaea                   |   300
 conserved in bacteria                  |  1119
 conserved in eukaryotes                |  4545
 conserved in eukaryotes only           |  2590
 conserved in fungi                     |  4639
 conserved in fungi only                |   545
 conserved in metazoa                   |  3557
 conserved in vertebrates               |  3537
 faster evolving duplicate              |    22
 no apparent S. cerevisiae ortholog     |   610
 orthologs cannot be distinguished      |   104
 predominantly single copy (one to one) |  3118
 Schizosaccharomyces pombe specific     |   152
 Schizosaccharomyces specific           |   224
(14 rows)
ValWood commented 5 years ago

Probably because I hardly changed anything....

ok, this can remain on back-burner

ValWood commented 3 years ago

@kimrutherford could you rerun this for me to see if I need to do any fixes?

kimrutherford commented 3 years ago
                  name                  | count 
----------------------------------------+-------
 conserved in archaea                   |   299
 conserved in bacteria                  |  1119
 conserved in eukaryotes                |  4544
 conserved in eukaryotes only           |  2592
 conserved in fungi                     |  4640
 conserved in fungi only                |   542
 conserved in metazoa                   |  3560
 conserved in vertebrates               |  3540
 faster evolving duplicate              |    22
 no apparent S. cerevisiae ortholog     |   608
 orthologs cannot be distinguished      |   104
 predominantly single copy (one to one) |  3122
 Schizosaccharomyces pombe specific     |   153
 Schizosaccharomyces specific           |   224
(14 rows)

Note to self:

SELECT t.name,
       count(fc.feature_cvterm_id)
FROM cvterm t
JOIN feature_cvterm fc ON fc.cvterm_id = t.cvterm_id
JOIN cv ON t.cv_id = cv.cv_id
WHERE cv.name = 'species_dist'
GROUP BY t.name ORDER BY t.name;
ValWood commented 3 years ago

great, I'll put this to future. I should usually spot anomolies and these hardly change.

kimrutherford commented 2 years ago

I've added a check for this in the logs. Look out for a log file ending in .species_dist_term_name_typos.