Open roger-mahler opened 1 year ago
Sample duplicate data from sead_production_201912
:
author_name = '(L.)', genus_name = 'Elaphrus', species = 'riparius'
select taxon_id, t.date_updated, author_id, genus_id, author_name, genus_name, species, bt.bugs_trace_id
from tbl_taxa_tree_master t
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
left join bugs_import.bugs_trace as bt
on bt.sead_table = 'tbl_taxa_tree_master'
and bt.sead_reference_id = taxon_id
where author_name = '(L.)'
and genus_name = 'Elaphrus'
and species = 'riparius'
taxon_id | date_updated | author_id | genus_id | bugs_trace_id | comment |
---|---|---|---|---|---|
28965 | 2014-04-17 06:02:04 | 3103 | 12656 | existing record | |
40204 | 2019-12-20 23:26:18 | 3103 | 12656 | 150283 | added by import |
Imported data seems to be assigned to existing taxon:
select *
from tbl_abundances
join tbl_analysis_entities using (analysis_entity_id)
join tbl_datasets using (dataset_id)
where taxon_id in (28965, 40204)
No item assigned to 40204.
Same data from BugsCEP:
select *
from "INDEX"
where "AUTHORITY" = '(L.)'
and "GENUS" = 'Elaphrus'
and "SPECIES" = 'riparius'
CODE | FAMILY | GENUS | SPECIES | AUTHORITY |
---|---|---|---|---|
1.0120030000000000 | CARABIDAE | Elaphrus | riparius | (L.) |
This specie has a synonym:
select *
from "TSynonym"
where "SynAuthority" = '(L.)'
and "SynGenus" = 'Elaphrus'
and "SynSpecies" = 'riparius'
CODE | SynGenus | SynSpecies | SynAuthority | Ref | Notes | SynonymCODE |
---|---|---|---|---|---|---|
1.0120045000000000 | Elaphrus | riparius | (L.) | Lindroth 1985 | regards this as a synonym. | 31 |
select *
from "INDEX"
where "CODE" in ('1.0120030000000000', '1.0120045000000000')
CODE | FAMILY | GENUS | SPECIES | AUTHORITY |
---|---|---|---|---|
1.0120030000000000 | CARABIDAE | Elaphrus | riparius | (L.) |
1.0120045000000000 | CARABIDAE | Elaphrus | tuberculatus | Mäklin |
Motsvarande poster i sead_production
:
select taxon_id, family_name, genus_name, species, author_name, taxonomic_code
from tbl_taxa_tree_master
join tbl_taxonomic_order using (taxon_id)
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
join tbl_taxa_tree_families using (family_id)
where taxonomic_code in (1.0120030000, 1.0120045000)
taxon_id | family_name | genus_name | species | author_name | taxonomic_code |
---|---|---|---|---|---|
28965 | CARABIDAE | Elaphrus | riparius | (L.) | 1.0120030000 |
28966 | CARABIDAE | Elaphrus | tuberculatus | Mäklin | 1.0120045000 |
select taxon_id, family_name, genus_name, species, author_name, taxonomic_code
from tbl_taxa_tree_master
left join tbl_taxonomic_order using (taxon_id)
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
join tbl_taxa_tree_families using (family_id)
where family_name = 'CARABIDAE'
and genus_name = 'Elaphrus'
and author_name in ('(L.)', 'Mäklin')
taxon_id | family_name | genus_name | species | author_name | taxonomic_code |
---|---|---|---|---|---|
28965 | CARABIDAE | Elaphrus | riparius | (L.) | 1.0120030000 |
28966 | CARABIDAE | Elaphrus | tuberculatus | Mäklin | 1.0120045000 |
40204 | CARABIDAE | Elaphrus | riparius | (L.) |
select bugs_table, sead_table, sead_reference_id, bugs_data, manipulation_type
from bugs_import.bugs_trace
where translated_compressed_data like '%Elaphrus%'
and bugs_table = 'TSynonym'
bugs_table | sead_table | sead_reference_id | bugs_data | manipulation_type |
---|---|---|---|---|
TSynonym | tbl_taxa_tree_master | 40204 | {1.0120045,Elaphrus,riparius,(L.),Lindroth 1985,regards this as a synonym.} | INSERT |
TSynonym | tbl_species_associations | 2650 | {1.0120045,Elaphrus,riparius,(L.),Lindroth 1985,regards this as a synonym.} | INSERT |
The bug can be reproduced with the following small BugsCEP database. The data is created using Jailer
for sample SAMP000546
, with a few data points from TFossil
. A record with CODE 1.0120045 has been added to INDEX
and TSynonym
(these two records might be sufficient to reproduce the error).
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('1.0120045000000000', 'CARABIDAE', 'Elaphrus', 'tuberculatus', 'Mäklin');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('1.0120030000000000', 'CARABIDAE', 'Elaphrus', 'riparius', '(L.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('23.028001499999998', 'STAPHYLINIDAE', 'Eucnecosum', 'brachypterum (grp)', '(Grav.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('93.015056999999999', 'CURCULIONIDAE', 'Otiorhynchus', 'nodosus', '(Möll.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('40.002102000000001', 'SCIRTIDAE', 'Ptilodactyla', 'exotica', 'Chapin');
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Bell & Walker (2005)', 'Bell, M. & Walker M.J.C. (1992)', 'Late Quaternary Environmental Change - Physical and Human Perspectives (Second Edition). Longman, Essex.', NULL);
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Bohme 2005', 'Böhme, J. (2005)', 'Die Köfer Mitteleuropas. K. Katalog (Faunistiche Übersicht) (2nd ed.). Spektrum Academic, Munich.', '(revised version of Lucht 1987)');
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Morris 2006', 'Morris, M. (2006)', 'Checklist of beetles of the British Isles, Curculionidae. <www.coleopterist.org.uk/curculionidae-list.htm>.', NULL);
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Strand 1946', 'Strand, A. (1946)', 'Nord Norges Coleoptera. Tromsö Museums Arshefter, Naturhistorisk Avd. Nr. 34, 67(1). (629pp.)', NULL);
insert into "TSite" ("SiteCODE", "SiteName", "Region", "Country", "NGR", "LatDD", "LongDD", "Alt", "IDBy", "Interp", "Specimens") values ('SITE000253', 'Håkulls Mosse, Kullaberg', 'Skåne', 'Sweden', NULL, 56.2999992, 12.5333338, 125, 'Lemdahl', 'Kullen Peninsula, see also Björkeröds mosse.', NULL);
insert into "TCountsheet" ("CountsheetCODE", "CountsheetName", "SiteCODE", "SheetContext", "SheetType") values ('COUN000144', 'Hakullsmosse_bugsdata.XLS', 'SITE000253', 'Stratigraphic sequence', 'Abundances');
insert into "TSample" ("SampleCODE", "SiteCODE", "X", "Y", "ZorDepthTop", "ZorDepthBot", "RefNrContext", "CountsheetCODE") values ('SAMP000546', 'SITE000253', NULL, NULL, NULL, NULL, 'B8:6/6', 'COUN000144');
insert into "TDatesMethods" ("Abbrev", "Method", "Type", "SortOrder") values ('GeolPer', 'Geological period', 'Period', 2);
insert into "TDatesPeriod" ("PeriodDateCODE", "SampleCODE", "Uncertainty", "PeriodCODE", "DatingMethod", "Notes") values ('PERI005175', 'SAMP000546', NULL, 'LG', 'GeolPer', NULL);
insert into "TFossil" ("FossilBugsCODE", "CODE", "SampleCODE", "Abundance") values ('FOSS014299', '93.015056999999999', 'SAMP000546', 3);
insert into "TFossil" ("FossilBugsCODE", "CODE", "SampleCODE", "Abundance") values ('FOSS144182', '23.028001499999998', 'SAMP000546', 30);
insert into "TLookupCountsheetContext" ("SheetContext", "SortOrder") values ('Stratigraphic sequence', 3);
insert into "TLookupCountsheetTypes" ("CountsheetType", "SortOrder") values ('Abundances', 1);
insert into "TPeriods" ("PeriodCODE", "PeriodName", "PeriodType", "PeriodDesc", "PeriodRef", "PeriodGeog", "Begin", "BeginBCAD", "End", "EndBCAD", "YearsType") values ('LG', 'Lateglacial', 'Geological', 'Cold period after the last Glaciation. Pollen Zones I-III', 'Bell & Walker (2005)', 'Europe', 13500, 'BP', 10000, 'BP', 'C14');
insert into "TSiteOtherProxies" ("OtherProxyID", "SiteCODE", "HasPollen", "HasPlantMacro", "HasDiatoms", "HasChironomids", "HasSoilChemistry", "HasIsotopes", "HasAnimalBones", "HasArchaeology", "HasMolluscs") values (1, 'SITE000253', 1, 0, 0, 0, 0, 0, 0, 0, 0);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('93.015056999999999', 'Otiorhynchus', 'dubius', '(Ström.)', 'Bohme 2005', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('93.015056999999999', 'Otiorhynchus', 'maurus', '(Gyllenhal) non (Marsham)', 'Morris 2006', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('23.028001499999998', 'Arpedium', NULL, NULL, 'Strand 1946', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('1.0120045000000000', 'Elaphrus', 'riparius', '(L.)', 'Lindroth 1985', 'regards this as a synonym.');
insert into "TTaxoNotes" ("CODE", "Ref", "Data") values ('40.002102000000001', 'Bohme 2005', 'Genus not listed.');
This is a minimal BugsCEP database that reproduces the error:
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values
('1.0120045000000000', 'CARABIDAE', 'Elaphrus', 'tuberculatus', 'Mäklin');
('1.0120030000000000', 'CARABIDAE', 'Elaphrus', 'riparius', '(L.)');
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values
('1.0120045000000000', 'Elaphrus', 'riparius', '(L.)', 'Lindroth 1985', 'regards this as a synonym.');
The mapping between items in BugsCEP INDEX
and SEAD taxon_id
is kept in table tbl_taxonomic_order
i.e. the first step is to find taxon_id
where taxonomic_code
equals INDEX.CODE
.
The items already exists in the SEAD database, so no new items are inserted.
select *
from tbl_taxonomic_order
where taxonomic_code in (1.0120030000, 1.0120045000)
taxonomic_order_id | date_updated | taxon_id | taxonomic_code | taxonomic_order_system_id |
---|---|---|---|---|
20245 | 2014-04-17 06:02:04 | 28965 | 1.0120030000 | 1 |
20246 | 2014-04-17 06:02:04 | 28966 | 1.0120045000 | 1 |
Each BugsCEP synonym item should have a corresponding record in SEAD table tbl_species_associations
. The minimal test database has the following synonym item:
CODE | SynGenus | SynSpecies | SynAuthority | Ref | Notes |
---|---|---|---|---|---|
1.01200450 | Elaphrus | riparius | (L.) | Lindroth 1985 | regards this as a synonym. |
If a synonym is previously imported (exists in the import trace) then nothing needs to be done. Otherwise a new species association is created comprising of:
taxon_id
)associated_taxon_id
)biblio_id
)The target species (tbl_taxa_tree_master
) is found by searching for synonym's CODE in tbl_taxa_tree_order
. The source species (tbl_taxa_tree_master
) is found by searching for synonym's species, genus and author in species repository (tbl_taxa_tree_master
).
Fetch target species (tbl_taxa_tree_master
) using taxonomic_code
in tbl_taxa_tree_order
. This item should always exists - if not there is an inconsistency in the BugsCEP database.
select taxon_id
from tbl_taxonomic_order
where taxonomic_code in (1.0120045000)
taxonomic_order_id | date_updated | taxon_id | taxonomic_code | taxonomic_order_system_id |
---|---|---|---|---|
20246 | 2014-04-17 06:02:04 | 28966 |
1.0120045000 | 1 |
select taxon_id, author_name, genus_name, species
from tbl_taxa_tree_master
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
where taxon_id = 28966
taxon_id | author_name | genus_name | species |
---|---|---|---|
28966 | Mäklin | Elaphrus | tuberculatus |
Finding the source species is basically a simple search in SEAD of the given species name (riparius
), genus name (Elaphrus
) and authority name (Mäklin
).
select taxon_id, author_name, genus_name, species
from tbl_taxa_tree_master
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
where species = 'riparius'
and genus_name = 'Elaphrus'
and author_name = '(L.)'
Note that this species exists in the SEAD database. The bugs import, however, uses the authority name given by the target species ((L.)
) instead of the authority given by the synonym item (Mäklin
). This can be seen on line 149 in the following code snippet.
Since the combination (riparius
, Elaphrus
, (L.)
) is missing in the database the system adds a new species.This species is however assigned the authority of the synonym (Mäklin
), and not the authority used when searching for an existing species ((L.)
).
This results in a duplicate species record (riparius
, Elaphrus
, Mäklin
) in the database.
We have two options to correct the issue:
SynAuthor
).
Duplicates can be identified by this query: