tarekmed / gbif-ecat

Automatically exported from code.google.com/p/gbif-ecat
0 stars 0 forks source link

Many taxa duplicated in new release #103

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
E.g. many entire genera (with their species) occur twice. This is a new problem 
with the latest release. Here is a sampling:

Pezizella
Phyllachora
Sphaeropsis
Mycena
Nemertes
Bremia
Helophorus
Notophthiracarus
Phacus

but I suspect there are dozens of them. (My list of GBIF-related suspect 
homonyms in the opentree taxonomy has 1600 entries, but some of those could be 
due to bugs in my code.)

Original issue reported on code.google.com by jonathan...@gmail.com on 17 Jul 2013 at 2:58

GoogleCodeExporter commented 8 years ago
Thanks Jonathan. I was sure we had too many homonyms in our taxonomy due to 
extensive numbers in IRMNG, but I was not aware of some that come from 
different sources. Pezizella for example comes from the catalog of life and 
Index Fungorum and only really differs in their order and family name. That 
should not have happened.

Btw, check out our new species pages in the upcoming new portal:
http://uat.gbif.org/species/5952828
http://uat.gbif.org/species/7245629

Do you have any good ideas how to discover real homonyms apart from keeping a 
manual list as we do with IRMNG? We do not make use of authorship so far and in 
some cases of the example you listed this would have told us clearly it is the 
same name. But in general its pretty tough to deal with irregular authorship 
spellings so we decided to not rely on that.

Also in case we cannot decide GBIF prefers to have duplicate taxa than merging 
2 real ones into a single taxon. This causes more trouble for us than having a 
false duplicate.

Original comment by wixner@gmail.com on 19 Jul 2013 at 12:49

GoogleCodeExporter commented 8 years ago

Original comment by wixner@gmail.com on 19 Jul 2013 at 12:50

GoogleCodeExporter commented 8 years ago
Pezizella is listed as a homonym in IRMNG and so does Index Fungorum:

Pezizella P. Karsten, 1872 for Thelebolus
GENUS SYNONYM from IRMNG Homonym List
Fungi Ascomycota Leotiomycetes Thelebolales Thelebolaceae

Pezizella Fuckel, 1870 for Calycina Nees ex Gray, 1821
GENUS SYNONYM from IRMNG Homonym List
Fungi Ascomycota Leotiomycetes Helotiales Hyaloscyphaceae Calycina

See http://bit.ly/1axaLhs

I am just surprised to see both of them being accepted in our backbone as 
(most) sources list them as synonyms

Original comment by wixner@gmail.com on 19 Jul 2013 at 2:08

GoogleCodeExporter commented 8 years ago
Re "Do you have any good ideas how to discover real homonyms" - the question 
is, when do two taxon records (from different sources, or not) refer to the 
same taxon, and not. I spent quite a bit of time on this question and my code 
for deciding this is based on what names occur in the ancestor chains of each, 
and in the sets of descendants of each. Overlap among descendants is a pretty 
good (but not 100% reliable) indicator, while having the parent of one occur in 
the ancestor chain of the other is also pretty good. The code is in github but 
is pretty much uncommented so probably wouldn't be of much use.

I understand about false matches being much more dangerous than false 
mismatches; we have the same policy. However what's notable about the homonyms 
I'm reporting here is that (a) they did not exist in the previous version of 
your taxonomy (b) they are extreme in the sense of consisting of duplication of 
a genus together with many species in it (e.g. 20 species in Phyllachora are 
all duplicated together). This could be due either to changes in the source 
taxonomies or a newly introduced bug in your merge algorithm. In either case I 
would think the repair is to make your taxon identity detector more intelligent.

If you'd like to work together I'd be very happy to!

Original comment by jonathan...@gmail.com on 19 Jul 2013 at 3:26

GoogleCodeExporter commented 8 years ago
Tony has just published a new 3.1 version of IRMNG complete and the homonym 
list. Importing now, gonna see if things have changed 

Original comment by wixner@gmail.com on 29 Aug 2013 at 3:29