Improve the preparation for the name match (given name, family, particle, etc.)

infinite-dao commented 1 year ago

After name parsing the composed name should reflect (optimally) real life names on wikidata, so the question is how to concatenate after dwc_agent parsing the compare-name, for instance, the name particle can contain also a name part (in 3.0.16.0):

family	given	…	particle
Aguilar	M.L.	…	Reyna de

… in Wikidata e.g. “Reinaldo Aguilar” (https://www.wikidata.org/wiki/Q33661023) could be considered to match nearer to “M. L. Reyna de Aguilar” as to “M. L. Aguilar” without particle, so the name particle should be concatenated as well for those particle cases, where particle contains multiple words, like “Reyna de” or perhaps also “van der” aso..

infinite-dao commented 1 year ago

Note that with and without comma the particle parses differently:

-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}

infinite-dao commented 11 months ago

One could also run both name lists through dwcagent with the same method: the collector names (source) run through dwcagent anyway, and the WikiData names could be subjected to the same method, so to speak as a uniform standardisation of both name lists, which are then compared.

infinite-dao commented 11 months ago

WikiData names with parentheses

To prepare itemLabel as a name field, and to compile real name strings, it makes sense to check the contents in parentheses to see if they contain years or similar and remove them, for example as Python code:

for i, item in df.iterrows():
    thisItemLabel = item['itemLabel']
    # remove life time e.g. "… (c. 1534)" or "… (1748-1801)"
    thisItemLabel = re.sub(r" +\([c. ]*\d+[-–—]*\d+\) *" , r"", thisItemLabel)
    # remove noble designations, e.g. “Sir James Nasmyth, 2nd Baronet” → “Sir James Nasmyth”
    thisItemLabel = re.sub(r" *, +[2]*(1st|2nd|3rd|[4-9]th|[1][0-9]th)[^,]+$" , r"", thisItemLabel)
    namewords = re.split('[ .]', thisItemLabel)
    # … do further things with "namewords"

And to clean up or prepare person names in the best possible way, we can check which WikiData names contain parentheses in itemLabel and what content it is, for this we can search the query data as follows:

cd /collector-matching/data
# make tabular data from csv
../bin/csv2tsv.py wikidata_persons_botanists_20231030_1539.csv

# get awk field names
head wikidata_persons_botanists_20231030_1539.csv.tsv -n 1 | sed 's@\t@\n@g' | awk '{print "# " $1 " ($" NR ")" }'
# Unnamed: ($1)
# item ($2)
# itemLabel ($3)
# surname ($4)
# initials ($5)
# canonical_string ($6)
# canonical_string_fullname ($7)
# orcid ($8)
# viaf ($9)
# …

cat wikidata_persons_botanists_20231030_1539.csv.tsv \
  | awk --field-separator=$"\t" '{print $3}' \
  | grep -i ")$" | sort --field-separator="("  -k2
  # sort --debug --field-separator="(" -k2 … will sort by the parentheses, second field:
  # Gustav Adolf Ferdinand Eichler (1835-1906)
  #                                 __________
  # __________________________________________
  # Søren Sørensen (1873-1926)
  #                 __________
  # __________________________
  # aso.

So we get names having parentheses like …

Gustav Adolf Ferdinand Eichler (1835-1906)
Søren Sørensen (1873-1926)
Georges André (1888–1973)
Johannes Johannessen (1904-1990)
Helge Buen (1918-2005)
Vlk Valenta (1925-2010)
Amalesh Choudhury (bot.)
Bror Pettersson (botaniker)
Robert W. Jones (botanist)
Thomas Cooper (botanist)
Yi Huang (botanist-1)
William Vernon (c. 1666-1711)
James Smith (diatomist)
Josep María Vidal(-Frigola)
István Balázs (instruisto)
Hildur von Rettig (Lindberg)
Inger Kaasa (Magistad)
Kai Zhang (mycologist)
Ting-Ting Zhang (mycologist)
Robert J. Ferry (Sr.)
Phraya Wanpruekphichan (Thongkham Savetsila)
Phraya Winitwanandon (To Komet)
Maria Pavlovna Nagibina (Tsybulskaya)
O. Heylen (-Walraevens)
Bill Kasongo (Wa Ngoy Kashiki)

Parse names with parentheses also with dwcagent

When parsing the itemLabel names with dwcagent as well …

IFS=$'\n'
for text in $(cat wikidata_persons_botanists_20231030_1539.csv.tsv \
  | awk --field-separator=$"\t" '{print $3}' \
  | grep -i ")$" | sort --field-separator="("  -k2);   do
  echo -en "-----------\ninput: $text\n"
  results=$(dwcagent "${text}") 
  if [[ "${results-}" == "[]" ]];then
  echo "output: $results"
  else
  echo "$results" | jq -c '.[] | with_entries(select(.value |.!=null))'
  fi
done
unset $IFS

… we get:

-----------
input: Gustav Adolf Ferdinand Eichler (1835-1906)
{"family":"Eichler","given":"Gustav Adolf Ferdinand"}
-----------
input: Søren Sørensen (1873-1926)
{"family":"Sørensen","given":"Søren"}
-----------
input: Georges André (1888–1973)
{"family":"André","given":"Georges"}
-----------
input: Johannes Johannessen (1904-1990)
{"family":"Johannessen","given":"Johannes"}
-----------
input: Helge Buen (1918-2005)
{"family":"Buen","given":"Helge"}
-----------
input: Vlk Valenta (1925-2010)
{"family":"Valenta","given":"Vlk"}
-----------
input: Amalesh Choudhury (bot.)
{"family":"Choudhury","given":"Amalesh"}
-----------
input: Bror Pettersson (botaniker)
{"family":"Pettersson","given":"Bror"}
-----------
input: Robert W. Jones (botanist)
{"family":"Jones","given":"Robert W."}
-----------
input: Thomas Cooper (botanist)
{"family":"Cooper","given":"Thomas"}
-----------
input: Yi Huang (botanist-1)
{"family":"Huang","given":"Yi"}
-----------
input: William Vernon (c. 1666-1711)
{"family":"Vernon","given":"William"}
-----------
input: James Smith (diatomist)
{"family":"Smith","given":"James"}
-----------
input: Josep María Vidal(-Frigola)
{"family":"Vidal","given":"Josep María"}
-----------
input: István Balázs (instruisto)
{"family":"Balázs","given":"István"}
-----------
input: Hildur von Rettig (Lindberg)
{"family":"Rettig","given":"Hildur","particle":"von"}
-----------
input: Inger Kaasa (Magistad)
{"family":"Kaasa","given":"Inger"}
-----------
input: Kai Zhang (mycologist)
{"family":"Zhang","given":"Kai"}
-----------
input: Ting-Ting Zhang (mycologist)
{"family":"Zhang","given":"Ting-Ting"}
-----------
input: Robert J. Ferry (Sr.)
{"family":"Ferry","given":"Robert J."}
-----------
input: Phraya Wanpruekphichan (Thongkham Savetsila)
{"family":"Wanpruekphichan","given":"Phraya"}
-----------
input: Phraya Winitwanandon (To Komet)
{"family":"Winitwanandon","given":"Phraya"}
-----------
input: Maria Pavlovna Nagibina (Tsybulskaya)
{"family":"Nagibina","given":"Maria Pavlovna"}
-----------
input: O. Heylen (-Walraevens)
{"family":"Heylen","given":"O."}
-----------
input: Bill Kasongo (Wa Ngoy Kashiki)
{"family":"Kasongo","given":"Bill"}

Parsing of parentheses content gets removed as a concept, see answer https://github.com/bionomia/dwc_agent/issues/18#issuecomment-1810976221

Names in order `family, given names`

# see what names have in canonical_string_fullname ($7) a comma:
cat wikidata_persons_botanists_20231115_1643.csv.tsv \
  | awk --field-separator=$"\t" '{print $7}' \
  | grep -i "," | sort

… we get:

Aleksandr [Alexander] Fedorovitch [Fedorovic, Theodorovich, Theodorowitsch] Flerov Alton McCaleb Harvill, Jr. Amelia Egerton, Lady Hume Antonio, Jr. Bertoloni Arnold Schultze, later Schultze-Rhonhof Arthur C. Risser, Jr. Athey Graves Gillaspie, Jr. Bai, Xin Xiang Barbara Rawdon-Hastings, Marchioness of Hastings Baudoin-Bodin, Jacqueline Benjamin Silliman, Sr. Brennecke, Dorothea Cao, Hong Lin Charles Montague Cooke, Jr. Charles W. Hagen, Jr. Chen, Hai Ling Chen, Mou Chi, Chün-tao Claude Earle, Jr. Smith Cormack, R. G. H. Dai, Kai-Zhi Dayal, Ram DeKalb Russell, Jr. Donald J. Barnett, Jr. Eduard Ladislas Kaunitz, baron von Holmberg Edward Gerrard, Jr. Edward Smith Deevey, Jr. Edwin Horace Bryan, Jr. Elías, hermano Eric Ashby, Baron Ashby Francis Abbott, Jr. François-Louis Laporte, comte de Castelnau Frank Cooper Craighead, Sr. Gajón Sánchez, Carlos Garland R. Upchurch, Jr. George Sherman Avery, Jr. Georges-Louis Leclerc, Comte de Buffon G. Robert Lunz, Jr. Harold Jefferson Coolidge, Jr. Harrison Gray Dyar, Jr. Henrietta Clive, Countess of Powis Ho, Fu-shun Horton Holcombe Hobbs, Jr. Hugh G. Gauch, Jr. Imre Máthé, jr. Ion(Ioan,Joan) C. Constantineanu Irat, Pierre François Albert , Ivar Mathias Wartiainen James Payne, Jr Smith James Raymond Bray, Jr. James Veitch, Jr. Jenö (Eugen,Eugène) Vadas J., IV Riddell J. Knox Jones, Jr. J. Mincy, Jr. Moffett John E., III Fairey John Milton Fogg, Jr. John Watson, jun. Angell Lansing, Odelle Edward, Jr. Len Lindstrand, III Leonard Charles Ferrington, Jr. Lin, Wen-Chih Lin, Ying Louisa, Countess of Aylesford Lucian West Chaney, Jr. Marcus Ward Lyon, Jr. Margaret Bentinck, Duchess of Portland Marguerite Augusta Marie Löwenhielm, duchesse de Fitz-James Mary Somerset, Duchess of Beaufort Melchor S. Sumalinog, Jr. Octave-Henri Gabriel, comte de Ségur Paul David Hurd, Jr. Paul Wilhelm, Duke of Württemberg Peng, Shi-Fang P.M.J., van Hoeken-Klinkenberg Qian, Yingqian Raymond Andrew Paynter, Jr. Reginald Heber Howe, Jr. Reinier Cornelis Bakhuizen van den Brink, Jr. Rev. Alcott, William P. Richard E. Riefner, Jr. Robert Edward Perdue, Jr. Robert Etheridge, Junior R., of Suhr Haller Ross H. Arnett, Jr. Ruchinger, Giuseppe d. J. Ruska, W. F. Samuel J. Ciurca, Jr. Sezana, Sr Grom Sieker, W. E. Stanley Gordon Jewett, Jr. Stefan, Jr. Gartner Theodore Salisbury Woolsey, Jr. Thomas F. McGuinness, Jr. Tomas Reyes, Jr. Welle, B. J. H. ter Wilhelm, der Ältere Hartmann William Coxe, Jr. William E., III, Fox William F. Rapp, Jr. William Gray Gambill, Jr. William Hartman Woodin, III William H. Weston, Jr. William Roxburgh, Junior Withall, Elizabeth M. Wu, Zong-Lian Абрамова, Лариса Михайловна Алексеев, Яков Яковлевич Бахиев, Амин Бахиевич Вениаминов, Пётр Дмитриевич Вердеревский, Дмитрий Дмитриевич Висковатов, Валериан Александрович Водков, Аркадий Петрович Гаганов, Павел Гаврилович Гербановский, Христофор Исидорович Головкин, Борис Николаевич Гоманьков, Алексей Владимирович Гроздов, Борис Владимирович Гроссет, Гуго Эдгарович Дедов, Андрей Алексеевич Декапрелевич, Леонард Леонардович Декенбах, Константин Николаевич Дзевановский, Сергей Антонович Заблуда, Григорий Васильевич Керн, Эдуард Эдуардович Коверга, Анатолий Сафронович Козловская, Наталия Витальевна Корнух-Троцкий, Пётр Яковлевич Кота-Санчес, Уго Красичков, Вячеслав Прокофьевич Крафтс, Алден Спрингер Крашенинников, Фёдор Николаевич Крутицкий, Пётр Яковлевич Кудеяров, Валерий Николаевич Куплеваский, Николай Осипович Кученёва, Галина Георгиевна Леваковский, Николай Фёдорович Лейсле, Виктор Филиппович Менабде, Владимир Леванович Мирахмедов, Садык Мирахмедович Михайлов, Дмитрий Сергеевич Мошков, Борис Сергеевич Нелюбов, Дмитрий Николаевич Норин, Борис Николаевич Нюкша, Юлия Петровна Паламарь-Мордвинцева, Галина Михайловна Потапенко, Георгий Иосифович Раздорский, Владимир Фёдорович Ремнёва, Зоя Ивановна Рещиков, Михаил Андреевич Савич, Реля Сказкин, Фёдор Данилович Скробишевский, Владислав Яковлевич Смирнов, Александр Иванович Страхов, Тимофей Даниилович Тимофеев, Пётр Алексеевич Треспе, Георгий Германович Файвуш, Георгий Маркович Хисориев, Хикмат Хисориевич Чекалинская, Наталья Ивановна Чистяков, Иван Дорофеевич Шель, Юлиан Карлович Шигапов, Зиннур Хайдарович Штейп, Владимир Владимирович Шуранов, Пётр Григорьевич Щепотьев, Фёдор Львович Щерба, Виктор Васильевич Янович, Алексей Онисимович

Challenges and to do‘s

[x] clean date in parentheses
[x] remove general occupational titles
[ ] deal with names having a comma
[ ] find the last family name or parse itemLabel by dwcagent as well

infinite-dao commented 11 months ago

Perhaps also include skos:altLabel for the names to get more other name writings from WikiData, one could try (356,708 results in 35,640 ms):

SELECT DISTINCT 
  ?item ?itemLabel ?altLabel ?altLabel_lang ?abbr 
  ?yob ?yod
  ?fly ?wyb ?wye
  ?orcid ?viaf ?isni ?harv ?ipni ?bionomia_id 
  WHERE {
    ?item wdt:P31 wd:Q5 ;
        p:P106 ?statement_occupation_botanist.
    # ?statement_occupation_botanist (ps:P106) wd:Q2374149.
    ?statement_occupation_botanist (ps:P106/(wdt:P279*)) wd:Q2374149.
    OPTIONAL { ?item rdfs:label ?itemLabel . FILTER (lang(?itemLabel) IN("en", "de") ) }
    OPTIONAL { ?item skos:altLabel ?altLabel . FILTER (lang(?altLabel) IN("en", "de") ) 
              BIND( lang(?altLabel)  as ?altLabel_lang )
    }
    OPTIONAL { ?item wdt:P496  ?orcid . }
    OPTIONAL { ?item wdt:P214  ?viaf . }
    OPTIONAL { ?item wdt:P213  ?isni . }
    OPTIONAL { ?item wdt:P6264 ?harv . }
    OPTIONAL { ?item wdt:P586  ?ipni . }
    OPTIONAL { ?item wdt:P428  ?abbr . }
    OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
    OPTIONAL { ?item wdt:P569  ?dob . BIND(YEAR(?dob) as ?yob) }
    OPTIONAL { ?item wdt:P570  ?dod . BIND(YEAR(?dod) as ?yod) }
    OPTIONAL { ?item wdt:P1317 ?fl .  BIND(YEAR(?fl)  as ?fly) } # floruit year
    OPTIONAL { ?item wdt:P2031 ?wpb . BIND(YEAR(?wpb) as ?wyb) } # work periode beginning
    OPTIONAL { ?item wdt:P2032 ?wpe . BIND(YEAR(?wpe) as ?wye) } # work periode end
  }    
  LIMIT 400000 # 356,708 results in 35,640 ms – it seems faster to limit it just above the real number of total results

… or removing fields with sparse data, like ?fly, ?wyb and ?wye can be faster (356,594 results in 12,644 ms):

SELECT DISTINCT 
  ?item ?itemLabel ?altLabel ?altLabel_lang ?abbr 
  ?yob ?yod
  ?orcid ?viaf ?isni ?harv ?ipni ?bionomia_id 
  WHERE {
    ?item wdt:P31 wd:Q5 ;
        p:P106 ?statement_occupation_botanist.
    # ?statement_occupation_botanist (ps:P106) wd:Q2374149.
    ?statement_occupation_botanist (ps:P106/(wdt:P279*)) wd:Q2374149.
    OPTIONAL { ?item rdfs:label ?itemLabel . FILTER (lang(?itemLabel) IN("en", "de") ) }
    OPTIONAL { ?item skos:altLabel ?altLabel . FILTER (lang(?altLabel) IN("en", "de") ) 
              BIND( lang(?altLabel)  as ?altLabel_lang )
    }
    OPTIONAL { ?item wdt:P496  ?orcid . }
    OPTIONAL { ?item wdt:P214  ?viaf . }
    OPTIONAL { ?item wdt:P213  ?isni . }
    OPTIONAL { ?item wdt:P6264 ?harv . }
    OPTIONAL { ?item wdt:P586  ?ipni . }
    OPTIONAL { ?item wdt:P428  ?abbr . }
    OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
    OPTIONAL { ?item wdt:P569  ?dob . BIND(YEAR(?dob) as ?yob) }
    OPTIONAL { ?item wdt:P570  ?dod . BIND(YEAR(?dod) as ?yod) }
  }    
  LIMIT 400000 # it seems faster to limit it just above the real number of total results

infinite-dao / collector-matching