Open infinite-dao opened 1 year ago
Note that with and without comma the particle parses differently:
-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}
See also https://github.com/bionomia/dwc_agent/issues/18#issuecomment-1790992882
One could also run both name lists through dwcagent with the same method: the collector names (source) run through dwcagent anyway, and the WikiData names could be subjected to the same method, so to speak as a uniform standardisation of both name lists, which are then compared.
To prepare itemLabel
as a name field, and to compile real name strings, it makes sense to check the contents in parentheses to see if they contain years or similar and remove them, for example as Python code:
for i, item in df.iterrows():
thisItemLabel = item['itemLabel']
# remove life time e.g. "… (c. 1534)" or "… (1748-1801)"
thisItemLabel = re.sub(r" +\([c. ]*\d+[-–—]*\d+\) *" , r"", thisItemLabel)
# remove noble designations, e.g. “Sir James Nasmyth, 2nd Baronet” → “Sir James Nasmyth”
thisItemLabel = re.sub(r" *, +[2]*(1st|2nd|3rd|[4-9]th|[1][0-9]th)[^,]+$" , r"", thisItemLabel)
namewords = re.split('[ .]', thisItemLabel)
# … do further things with "namewords"
And to clean up or prepare person names in the best possible way, we can check which WikiData names contain parentheses in itemLabel
and what content it is, for this we can search the query data as follows:
cd /collector-matching/data
# make tabular data from csv
../bin/csv2tsv.py wikidata_persons_botanists_20231030_1539.csv
# get awk field names
head wikidata_persons_botanists_20231030_1539.csv.tsv -n 1 | sed 's@\t@\n@g' | awk '{print "# " $1 " ($" NR ")" }'
# Unnamed: ($1)
# item ($2)
# itemLabel ($3)
# surname ($4)
# initials ($5)
# canonical_string ($6)
# canonical_string_fullname ($7)
# orcid ($8)
# viaf ($9)
# …
cat wikidata_persons_botanists_20231030_1539.csv.tsv \
| awk --field-separator=$"\t" '{print $3}' \
| grep -i ")$" | sort --field-separator="(" -k2
# sort --debug --field-separator="(" -k2 … will sort by the parentheses, second field:
# Gustav Adolf Ferdinand Eichler (1835-1906)
# __________
# __________________________________________
# Søren Sørensen (1873-1926)
# __________
# __________________________
# aso.
So we get names having parentheses like …
When parsing the itemLabel
names with dwcagent as well …
IFS=$'\n'
for text in $(cat wikidata_persons_botanists_20231030_1539.csv.tsv \
| awk --field-separator=$"\t" '{print $3}' \
| grep -i ")$" | sort --field-separator="(" -k2); do
echo -en "-----------\ninput: $text\n"
results=$(dwcagent "${text}")
if [[ "${results-}" == "[]" ]];then
echo "output: $results"
else
echo "$results" | jq -c '.[] | with_entries(select(.value |.!=null))'
fi
done
unset $IFS
… we get:
-----------
input: Gustav Adolf Ferdinand Eichler (1835-1906)
{"family":"Eichler","given":"Gustav Adolf Ferdinand"}
-----------
input: Søren Sørensen (1873-1926)
{"family":"Sørensen","given":"Søren"}
-----------
input: Georges André (1888–1973)
{"family":"André","given":"Georges"}
-----------
input: Johannes Johannessen (1904-1990)
{"family":"Johannessen","given":"Johannes"}
-----------
input: Helge Buen (1918-2005)
{"family":"Buen","given":"Helge"}
-----------
input: Vlk Valenta (1925-2010)
{"family":"Valenta","given":"Vlk"}
-----------
input: Amalesh Choudhury (bot.)
{"family":"Choudhury","given":"Amalesh"}
-----------
input: Bror Pettersson (botaniker)
{"family":"Pettersson","given":"Bror"}
-----------
input: Robert W. Jones (botanist)
{"family":"Jones","given":"Robert W."}
-----------
input: Thomas Cooper (botanist)
{"family":"Cooper","given":"Thomas"}
-----------
input: Yi Huang (botanist-1)
{"family":"Huang","given":"Yi"}
-----------
input: William Vernon (c. 1666-1711)
{"family":"Vernon","given":"William"}
-----------
input: James Smith (diatomist)
{"family":"Smith","given":"James"}
-----------
input: Josep María Vidal(-Frigola)
{"family":"Vidal","given":"Josep María"}
-----------
input: István Balázs (instruisto)
{"family":"Balázs","given":"István"}
-----------
input: Hildur von Rettig (Lindberg)
{"family":"Rettig","given":"Hildur","particle":"von"}
-----------
input: Inger Kaasa (Magistad)
{"family":"Kaasa","given":"Inger"}
-----------
input: Kai Zhang (mycologist)
{"family":"Zhang","given":"Kai"}
-----------
input: Ting-Ting Zhang (mycologist)
{"family":"Zhang","given":"Ting-Ting"}
-----------
input: Robert J. Ferry (Sr.)
{"family":"Ferry","given":"Robert J."}
-----------
input: Phraya Wanpruekphichan (Thongkham Savetsila)
{"family":"Wanpruekphichan","given":"Phraya"}
-----------
input: Phraya Winitwanandon (To Komet)
{"family":"Winitwanandon","given":"Phraya"}
-----------
input: Maria Pavlovna Nagibina (Tsybulskaya)
{"family":"Nagibina","given":"Maria Pavlovna"}
-----------
input: O. Heylen (-Walraevens)
{"family":"Heylen","given":"O."}
-----------
input: Bill Kasongo (Wa Ngoy Kashiki)
{"family":"Kasongo","given":"Bill"}
Parsing of parentheses content gets removed as a concept, see answer https://github.com/bionomia/dwc_agent/issues/18#issuecomment-1810976221
family, given names
# see what names have in canonical_string_fullname ($7) a comma:
cat wikidata_persons_botanists_20231115_1643.csv.tsv \
| awk --field-separator=$"\t" '{print $7}' \
| grep -i "," | sort
… we get:
Aleksandr [Alexander] Fedorovitch [Fedorovic, Theodorovich, Theodorowitsch] Flerov Alton McCaleb Harvill, Jr. Amelia Egerton, Lady Hume Antonio, Jr. Bertoloni Arnold Schultze, later Schultze-Rhonhof Arthur C. Risser, Jr. Athey Graves Gillaspie, Jr. Bai, Xin Xiang Barbara Rawdon-Hastings, Marchioness of Hastings Baudoin-Bodin, Jacqueline Benjamin Silliman, Sr. Brennecke, Dorothea Cao, Hong Lin Charles Montague Cooke, Jr. Charles W. Hagen, Jr. Chen, Hai Ling Chen, Mou Chi, Chün-tao Claude Earle, Jr. Smith Cormack, R. G. H. Dai, Kai-Zhi Dayal, Ram DeKalb Russell, Jr. Donald J. Barnett, Jr. Eduard Ladislas Kaunitz, baron von Holmberg Edward Gerrard, Jr. Edward Smith Deevey, Jr. Edwin Horace Bryan, Jr. Elías, hermano Eric Ashby, Baron Ashby Francis Abbott, Jr. François-Louis Laporte, comte de Castelnau Frank Cooper Craighead, Sr. Gajón Sánchez, Carlos Garland R. Upchurch, Jr. George Sherman Avery, Jr. Georges-Louis Leclerc, Comte de Buffon G. Robert Lunz, Jr. Harold Jefferson Coolidge, Jr. Harrison Gray Dyar, Jr. Henrietta Clive, Countess of Powis Ho, Fu-shun Horton Holcombe Hobbs, Jr. Hugh G. Gauch, Jr. Imre Máthé, jr. Ion(Ioan,Joan) C. Constantineanu Irat, Pierre François Albert , Ivar Mathias Wartiainen James Payne, Jr Smith James Raymond Bray, Jr. James Veitch, Jr. Jenö (Eugen,Eugène) Vadas J., IV Riddell J. Knox Jones, Jr. J. Mincy, Jr. Moffett John E., III Fairey John Milton Fogg, Jr. John Watson, jun. Angell Lansing, Odelle Edward, Jr. Len Lindstrand, III Leonard Charles Ferrington, Jr. Lin, Wen-Chih Lin, Ying Louisa, Countess of Aylesford Lucian West Chaney, Jr. Marcus Ward Lyon, Jr. Margaret Bentinck, Duchess of Portland Marguerite Augusta Marie Löwenhielm, duchesse de Fitz-James Mary Somerset, Duchess of Beaufort Melchor S. Sumalinog, Jr. Octave-Henri Gabriel, comte de Ségur Paul David Hurd, Jr. Paul Wilhelm, Duke of Württemberg Peng, Shi-Fang P.M.J., van Hoeken-Klinkenberg Qian, Yingqian Raymond Andrew Paynter, Jr. Reginald Heber Howe, Jr. Reinier Cornelis Bakhuizen van den Brink, Jr. Rev. Alcott, William P. Richard E. Riefner, Jr. Robert Edward Perdue, Jr. Robert Etheridge, Junior R., of Suhr Haller Ross H. Arnett, Jr. Ruchinger, Giuseppe d. J. Ruska, W. F. Samuel J. Ciurca, Jr. Sezana, Sr Grom Sieker, W. E. Stanley Gordon Jewett, Jr. Stefan, Jr. Gartner Theodore Salisbury Woolsey, Jr. Thomas F. McGuinness, Jr. Tomas Reyes, Jr. Welle, B. J. H. ter Wilhelm, der Ältere Hartmann William Coxe, Jr. William E., III, Fox William F. Rapp, Jr. William Gray Gambill, Jr. William Hartman Woodin, III William H. Weston, Jr. William Roxburgh, Junior Withall, Elizabeth M. Wu, Zong-Lian Абрамова, Лариса Михайловна Алексеев, Яков Яковлевич Бахиев, Амин Бахиевич Вениаминов, Пётр Дмитриевич Вердеревский, Дмитрий Дмитриевич Висковатов, Валериан Александрович Водков, Аркадий Петрович Гаганов, Павел Гаврилович Гербановский, Христофор Исидорович Головкин, Борис Николаевич Гоманьков, Алексей Владимирович Гроздов, Борис Владимирович Гроссет, Гуго Эдгарович Дедов, Андрей Алексеевич Декапрелевич, Леонард Леонардович Декенбах, Константин Николаевич Дзевановский, Сергей Антонович Заблуда, Григорий Васильевич Керн, Эдуард Эдуардович Коверга, Анатолий Сафронович Козловская, Наталия Витальевна Корнух-Троцкий, Пётр Яковлевич Кота-Санчес, Уго Красичков, Вячеслав Прокофьевич Крафтс, Алден Спрингер Крашенинников, Фёдор Николаевич Крутицкий, Пётр Яковлевич Кудеяров, Валерий Николаевич Куплеваский, Николай Осипович Кученёва, Галина Георгиевна Леваковский, Николай Фёдорович Лейсле, Виктор Филиппович Менабде, Владимир Леванович Мирахмедов, Садык Мирахмедович Михайлов, Дмитрий Сергеевич Мошков, Борис Сергеевич Нелюбов, Дмитрий Николаевич Норин, Борис Николаевич Нюкша, Юлия Петровна Паламарь-Мордвинцева, Галина Михайловна Потапенко, Георгий Иосифович Раздорский, Владимир Фёдорович Ремнёва, Зоя Ивановна Рещиков, Михаил Андреевич Савич, Реля Сказкин, Фёдор Данилович Скробишевский, Владислав Яковлевич Смирнов, Александр Иванович Страхов, Тимофей Даниилович Тимофеев, Пётр Алексеевич Треспе, Георгий Германович Файвуш, Георгий Маркович Хисориев, Хикмат Хисориевич Чекалинская, Наталья Ивановна Чистяков, Иван Дорофеевич Шель, Юлиан Карлович Шигапов, Зиннур Хайдарович Штейп, Владимир Владимирович Шуранов, Пётр Григорьевич Щепотьев, Фёдор Львович Щерба, Виктор Васильевич Янович, Алексей Онисимович
itemLabel
by dwcagent as wellPerhaps also include skos:altLabel
for the names to get more other name writings from WikiData, one could try (356,708 results in 35,640 ms):
SELECT DISTINCT
?item ?itemLabel ?altLabel ?altLabel_lang ?abbr
?yob ?yod
?fly ?wyb ?wye
?orcid ?viaf ?isni ?harv ?ipni ?bionomia_id
WHERE {
?item wdt:P31 wd:Q5 ;
p:P106 ?statement_occupation_botanist.
# ?statement_occupation_botanist (ps:P106) wd:Q2374149.
?statement_occupation_botanist (ps:P106/(wdt:P279*)) wd:Q2374149.
OPTIONAL { ?item rdfs:label ?itemLabel . FILTER (lang(?itemLabel) IN("en", "de") ) }
OPTIONAL { ?item skos:altLabel ?altLabel . FILTER (lang(?altLabel) IN("en", "de") )
BIND( lang(?altLabel) as ?altLabel_lang )
}
OPTIONAL { ?item wdt:P496 ?orcid . }
OPTIONAL { ?item wdt:P214 ?viaf . }
OPTIONAL { ?item wdt:P213 ?isni . }
OPTIONAL { ?item wdt:P6264 ?harv . }
OPTIONAL { ?item wdt:P586 ?ipni . }
OPTIONAL { ?item wdt:P428 ?abbr . }
OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
OPTIONAL { ?item wdt:P569 ?dob . BIND(YEAR(?dob) as ?yob) }
OPTIONAL { ?item wdt:P570 ?dod . BIND(YEAR(?dod) as ?yod) }
OPTIONAL { ?item wdt:P1317 ?fl . BIND(YEAR(?fl) as ?fly) } # floruit year
OPTIONAL { ?item wdt:P2031 ?wpb . BIND(YEAR(?wpb) as ?wyb) } # work periode beginning
OPTIONAL { ?item wdt:P2032 ?wpe . BIND(YEAR(?wpe) as ?wye) } # work periode end
}
LIMIT 400000 # 356,708 results in 35,640 ms – it seems faster to limit it just above the real number of total results
… or removing fields with sparse data, like ?fly
, ?wyb
and ?wye
can be faster (356,594 results in 12,644 ms):
SELECT DISTINCT
?item ?itemLabel ?altLabel ?altLabel_lang ?abbr
?yob ?yod
?orcid ?viaf ?isni ?harv ?ipni ?bionomia_id
WHERE {
?item wdt:P31 wd:Q5 ;
p:P106 ?statement_occupation_botanist.
# ?statement_occupation_botanist (ps:P106) wd:Q2374149.
?statement_occupation_botanist (ps:P106/(wdt:P279*)) wd:Q2374149.
OPTIONAL { ?item rdfs:label ?itemLabel . FILTER (lang(?itemLabel) IN("en", "de") ) }
OPTIONAL { ?item skos:altLabel ?altLabel . FILTER (lang(?altLabel) IN("en", "de") )
BIND( lang(?altLabel) as ?altLabel_lang )
}
OPTIONAL { ?item wdt:P496 ?orcid . }
OPTIONAL { ?item wdt:P214 ?viaf . }
OPTIONAL { ?item wdt:P213 ?isni . }
OPTIONAL { ?item wdt:P6264 ?harv . }
OPTIONAL { ?item wdt:P586 ?ipni . }
OPTIONAL { ?item wdt:P428 ?abbr . }
OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
OPTIONAL { ?item wdt:P569 ?dob . BIND(YEAR(?dob) as ?yob) }
OPTIONAL { ?item wdt:P570 ?dod . BIND(YEAR(?dod) as ?yod) }
}
LIMIT 400000 # it seems faster to limit it just above the real number of total results
After name parsing the composed name should reflect (optimally) real life names on wikidata, so the question is how to concatenate after dwc_agent parsing the compare-name, for instance, the name particle can contain also a name part (in 3.0.16.0):
… in Wikidata e.g. “Reinaldo Aguilar” (https://www.wikidata.org/wiki/Q33661023) could be considered to match nearer to “M. L. Reyna de Aguilar” as to “M. L. Aguilar” without particle, so the name particle should be concatenated as well for those particle cases, where particle contains multiple words, like “Reyna de” or perhaps also “van der” aso..