humlab-sead / sead_bugs_import

SEAD bugs import
2 stars 0 forks source link

Bugs import adds duplicate entries in taxa tree master #52

Open roger-mahler opened 1 year ago

roger-mahler commented 1 year ago

Duplicates can be identified by this query:

with duplicates as (
    select distinct author_name, genus_name, species
    from tbl_taxa_tree_master
    join tbl_taxa_tree_genera using (genus_id)
    join tbl_taxa_tree_authors using (author_id)
    group by author_name, genus_name, species
    having count(*) > 1
), taxa as (
    select taxon_id, genus_id, author_id, author_name, genus_name, species
    from tbl_taxa_tree_master
    join tbl_taxa_tree_genera using (genus_id)
    join tbl_taxa_tree_authors using (author_id)
)
    select taxon_id, author_name, genus_name, species, bt.*
    from duplicates
    join taxa using ( author_name, genus_name, species)
    left join bugs_import.bugs_trace as bt
      on bt.sead_table = 'tbl_taxa_tree_master'
     and bt.sead_reference_id = taxon_id
     -- and bt.bugs_table = 'INDEX'
    order by taxon_id
Bugs import Records Comment
Clean (no bugs import) 0 duplicates
sead_production_201912 78 duplicates After import of bugsdata_20190303
sead_production_201912 1636 duplicates After incremental import
roger-mahler commented 1 year ago

Sample duplicate data from sead_production_201912: author_name = '(L.)', genus_name = 'Elaphrus', species = 'riparius'

select taxon_id, t.date_updated, author_id, genus_id, author_name, genus_name, species, bt.bugs_trace_id
from tbl_taxa_tree_master t
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
left join bugs_import.bugs_trace as bt
  on bt.sead_table = 'tbl_taxa_tree_master'
 and bt.sead_reference_id = taxon_id
where author_name = '(L.)'
  and genus_name = 'Elaphrus'
  and species = 'riparius'
taxon_id date_updated author_id genus_id bugs_trace_id comment
28965 2014-04-17 06:02:04 3103 12656 existing record
40204 2019-12-20 23:26:18 3103 12656 150283 added by import

Imported data seems to be assigned to existing taxon:

select *
from tbl_abundances
join tbl_analysis_entities using (analysis_entity_id)
join tbl_datasets using (dataset_id)
where taxon_id in (28965, 40204)

No item assigned to 40204.

roger-mahler commented 1 year ago

Same data from BugsCEP:

select *
from "INDEX"
where "AUTHORITY" = '(L.)'
  and "GENUS" = 'Elaphrus'
  and "SPECIES" = 'riparius'
CODE FAMILY GENUS SPECIES AUTHORITY
1.0120030000000000 CARABIDAE Elaphrus riparius (L.)

This specie has a synonym:

select *
from "TSynonym"
where "SynAuthority" = '(L.)'
  and "SynGenus" = 'Elaphrus'
  and "SynSpecies" = 'riparius'
CODE SynGenus SynSpecies SynAuthority Ref Notes SynonymCODE
1.0120045000000000 Elaphrus riparius (L.) Lindroth 1985 regards this as a synonym. 31
select *
from "INDEX"
where "CODE" in ('1.0120030000000000', '1.0120045000000000')
CODE FAMILY GENUS SPECIES AUTHORITY
1.0120030000000000 CARABIDAE Elaphrus riparius (L.)
1.0120045000000000 CARABIDAE Elaphrus tuberculatus Mäklin

Motsvarande poster i sead_production:

select taxon_id, family_name, genus_name, species, author_name, taxonomic_code 
from tbl_taxa_tree_master
join tbl_taxonomic_order using (taxon_id)
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
join tbl_taxa_tree_families using (family_id)
where taxonomic_code in (1.0120030000, 1.0120045000)
taxon_id family_name genus_name species author_name taxonomic_code
28965 CARABIDAE Elaphrus riparius (L.) 1.0120030000
28966 CARABIDAE Elaphrus tuberculatus Mäklin 1.0120045000
select taxon_id, family_name, genus_name, species, author_name, taxonomic_code 
from tbl_taxa_tree_master
left join tbl_taxonomic_order using (taxon_id)
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
join tbl_taxa_tree_families using (family_id)
where family_name = 'CARABIDAE'
  and genus_name = 'Elaphrus'
  and author_name in ('(L.)', 'Mäklin')
taxon_id family_name genus_name species author_name taxonomic_code
28965 CARABIDAE Elaphrus riparius (L.) 1.0120030000
28966 CARABIDAE Elaphrus tuberculatus Mäklin 1.0120045000
40204 CARABIDAE Elaphrus riparius (L.)
select bugs_table, sead_table, sead_reference_id, bugs_data, manipulation_type
from bugs_import.bugs_trace
where translated_compressed_data like '%Elaphrus%'
  and bugs_table = 'TSynonym' 
bugs_table sead_table sead_reference_id bugs_data manipulation_type
TSynonym tbl_taxa_tree_master 40204 {1.0120045,Elaphrus,riparius,(L.),Lindroth 1985,regards this as a synonym.} INSERT
TSynonym tbl_species_associations 2650 {1.0120045,Elaphrus,riparius,(L.),Lindroth 1985,regards this as a synonym.} INSERT
roger-mahler commented 1 year ago

The bug can be reproduced with the following small BugsCEP database. The data is created using Jailer for sample SAMP000546, with a few data points from TFossil. A record with CODE 1.0120045 has been added to INDEX and TSynonym (these two records might be sufficient to reproduce the error).


insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('1.0120045000000000', 'CARABIDAE', 'Elaphrus', 'tuberculatus', 'Mäklin');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('1.0120030000000000', 'CARABIDAE', 'Elaphrus', 'riparius', '(L.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('23.028001499999998', 'STAPHYLINIDAE', 'Eucnecosum', 'brachypterum (grp)', '(Grav.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('93.015056999999999', 'CURCULIONIDAE', 'Otiorhynchus', 'nodosus', '(Möll.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('40.002102000000001', 'SCIRTIDAE', 'Ptilodactyla', 'exotica', 'Chapin');

insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Bell & Walker (2005)', 'Bell, M. & Walker M.J.C. (1992)', 'Late Quaternary Environmental Change - Physical and Human Perspectives (Second Edition). Longman, Essex.', NULL);
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Bohme 2005', 'Böhme, J. (2005)', 'Die Köfer Mitteleuropas. K. Katalog (Faunistiche Übersicht) (2nd ed.). Spektrum Academic, Munich.', '(revised version of Lucht 1987)');
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Morris 2006', 'Morris, M. (2006)', 'Checklist of beetles of the British Isles, Curculionidae. <www.coleopterist.org.uk/curculionidae-list.htm>.', NULL);
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Strand 1946', 'Strand, A. (1946)', 'Nord Norges Coleoptera. Tromsö Museums Arshefter, Naturhistorisk Avd. Nr. 34, 67(1). (629pp.)', NULL);

insert into "TSite" ("SiteCODE", "SiteName", "Region", "Country", "NGR", "LatDD", "LongDD", "Alt", "IDBy", "Interp", "Specimens") values ('SITE000253', 'Håkulls Mosse, Kullaberg', 'Skåne', 'Sweden', NULL, 56.2999992, 12.5333338, 125, 'Lemdahl', 'Kullen Peninsula, see also Björkeröds mosse.', NULL);

insert into "TCountsheet" ("CountsheetCODE", "CountsheetName", "SiteCODE", "SheetContext", "SheetType") values ('COUN000144', 'Hakullsmosse_bugsdata.XLS', 'SITE000253', 'Stratigraphic sequence', 'Abundances');

insert into "TSample" ("SampleCODE", "SiteCODE", "X", "Y", "ZorDepthTop", "ZorDepthBot", "RefNrContext", "CountsheetCODE") values ('SAMP000546', 'SITE000253', NULL, NULL, NULL, NULL, 'B8:6/6', 'COUN000144');

insert into "TDatesMethods" ("Abbrev", "Method", "Type", "SortOrder") values ('GeolPer', 'Geological period', 'Period', 2);

insert into "TDatesPeriod" ("PeriodDateCODE", "SampleCODE", "Uncertainty", "PeriodCODE", "DatingMethod", "Notes") values ('PERI005175', 'SAMP000546', NULL, 'LG', 'GeolPer', NULL);

insert into "TFossil" ("FossilBugsCODE", "CODE", "SampleCODE", "Abundance") values ('FOSS014299', '93.015056999999999', 'SAMP000546', 3);
insert into "TFossil" ("FossilBugsCODE", "CODE", "SampleCODE", "Abundance") values ('FOSS144182', '23.028001499999998', 'SAMP000546', 30);

insert into "TLookupCountsheetContext" ("SheetContext", "SortOrder") values ('Stratigraphic sequence', 3);

insert into "TLookupCountsheetTypes" ("CountsheetType", "SortOrder") values ('Abundances', 1);

insert into "TPeriods" ("PeriodCODE", "PeriodName", "PeriodType", "PeriodDesc", "PeriodRef", "PeriodGeog", "Begin", "BeginBCAD", "End", "EndBCAD", "YearsType") values ('LG', 'Lateglacial', 'Geological', 'Cold period after the last Glaciation. Pollen Zones I-III', 'Bell & Walker (2005)', 'Europe', 13500, 'BP', 10000, 'BP', 'C14');

insert into "TSiteOtherProxies" ("OtherProxyID", "SiteCODE", "HasPollen", "HasPlantMacro", "HasDiatoms", "HasChironomids", "HasSoilChemistry", "HasIsotopes", "HasAnimalBones", "HasArchaeology", "HasMolluscs") values (1, 'SITE000253', 1, 0, 0, 0, 0, 0, 0, 0, 0);

insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('93.015056999999999', 'Otiorhynchus', 'dubius', '(Ström.)', 'Bohme 2005', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('93.015056999999999', 'Otiorhynchus', 'maurus', '(Gyllenhal) non (Marsham)', 'Morris 2006', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('23.028001499999998', 'Arpedium', NULL, NULL, 'Strand 1946', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('1.0120045000000000', 'Elaphrus', 'riparius', '(L.)', 'Lindroth 1985', 'regards this as a synonym.');

insert into "TTaxoNotes" ("CODE", "Ref", "Data") values ('40.002102000000001', 'Bohme 2005', 'Genus not listed.');
roger-mahler commented 1 year ago

This is a minimal BugsCEP database that reproduces the error:

insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values
    ('1.0120045000000000', 'CARABIDAE', 'Elaphrus', 'tuberculatus', 'Mäklin');
    ('1.0120030000000000', 'CARABIDAE', 'Elaphrus', 'riparius', '(L.)');

insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values
    ('1.0120045000000000', 'Elaphrus', 'riparius', '(L.)', 'Lindroth 1985', 'regards this as a synonym.');
roger-mahler commented 1 year ago

Import INDEX items

The mapping between items in BugsCEP INDEXand SEAD taxon_idis kept in table tbl_taxonomic_order i.e. the first step is to find taxon_id where taxonomic_code equals INDEX.CODE.

The items already exists in the SEAD database, so no new items are inserted.

select *
from tbl_taxonomic_order
where taxonomic_code in (1.0120030000, 1.0120045000)
taxonomic_order_id date_updated taxon_id taxonomic_code taxonomic_order_system_id
20245 2014-04-17 06:02:04 28965 1.0120030000 1
20246 2014-04-17 06:02:04 28966 1.0120045000 1
roger-mahler commented 1 year ago

Import Synonym items

Each BugsCEP synonym item should have a corresponding record in SEAD table tbl_species_associations. The minimal test database has the following synonym item:

CODE SynGenus SynSpecies SynAuthority Ref Notes
1.01200450 Elaphrus riparius (L.) Lindroth 1985 regards this as a synonym.

If a synonym is previously imported (exists in the import trace) then nothing needs to be done. Otherwise a new species association is created comprising of:

The target species (tbl_taxa_tree_master ) is found by searching for synonym's CODE in tbl_taxa_tree_order. The source species (tbl_taxa_tree_master ) is found by searching for synonym's species, genus and author in species repository (tbl_taxa_tree_master ).

roger-mahler commented 1 year ago

Find or create association's target species

Fetch target species (tbl_taxa_tree_master ) using taxonomic_code in tbl_taxa_tree_order. This item should always exists - if not there is an inconsistency in the BugsCEP database.

select taxon_id
from tbl_taxonomic_order
where taxonomic_code in (1.0120045000)
taxonomic_order_id date_updated taxon_id taxonomic_code taxonomic_order_system_id
20246 2014-04-17 06:02:04 28966 1.0120045000 1

https://github.com/humlab-sead/sead_bugs_import/blob/62a882ca073791bfb82f77ebd40d1aa5a3991029/src/main/java/se/sead/bugsimport/speciessynonyms/SynonymCreator.java#L47-L54

select taxon_id, author_name, genus_name, species
from tbl_taxa_tree_master
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
where taxon_id = 28966    
taxon_id author_name genus_name species
28966 Mäklin Elaphrus tuberculatus
roger-mahler commented 1 year ago

Find or create association's source species

Finding the source species is basically a simple search in SEAD of the given species name (riparius), genus name (Elaphrus) and authority name (Mäklin).

select taxon_id, author_name, genus_name, species
from tbl_taxa_tree_master
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
where species = 'riparius'
  and genus_name = 'Elaphrus'
  and author_name = '(L.)'

Note that this species exists in the SEAD database. The bugs import, however, uses the authority name given by the target species ((L.)) instead of the authority given by the synonym item (Mäklin). This can be seen on line 149 in the following code snippet.

https://github.com/humlab-sead/sead_bugs_import/blob/62a882ca073791bfb82f77ebd40d1aa5a3991029/src/main/java/se/sead/bugsimport/speciessynonyms/SynonymSpeciesManager.java#L147-L156

Since the combination (riparius, Elaphrus, (L.)) is missing in the database the system adds a new species.This species is however assigned the authority of the synonym (Mäklin), and not the authority used when searching for an existing species ((L.)).

This results in a duplicate species record (riparius, Elaphrus, Mäklin) in the database.

roger-mahler commented 1 year ago

We have two options to correct the issue:

  1. When searching for existing source species the system should use the same autority as specified in the synonym (SynAuthor).
  2. When creating a new source species then it should be assigned the same authority as the target species.