qurator-spk / mods4pandas

Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames for data analysis
Apache License 2.0
11 stars 0 forks source link

Group names given in the MODS-file according to given roles to reduce number of columns #30

Open joergleh opened 11 months ago

joergleh commented 11 months ago

Aim: Reduce number of columns for better manageability of data frame Proposal: Group names given in the MODS-file according to given roles Explanation: Each "name" entry in the mods-file consists of at least four parts: "nameXX_namePart.family" "nameXX_namePart.given"
"nameXX_displayForm" "nameXX_role_roleTerm" The number of columns could significantly be reduced if the names would first be grouped according to the roles and then concatenated into a fewer number of columns. Examples: PPN735425078 contains 76 names with the role "asn" (= associated name); this amounts up to 304 columns, but could be reduced to three columns ("nameASN_namePart.family", "nameASN_namePart.given", "nameASN_displayForm"), each containing 76 names in nested form (Mauschwitz; Baudis; Hoberg; ...) PPN858144891 contains 50 names with the role "oth" (= other); this amounts up to 200 columns, but could be reduced to three columns ("nameOTH_namePart.family", "nameOTH_namePart.given", "nameOTH_displayForm") PPN1774254956 contains 42 names with the role "ctb" (= contributor); this amounts up to 168 columns, but could be reduced to three columns ("nameCTB_namePart.family", "nameCTB_namePart.given", "nameCTB_displayForm") The most frequently used roles are asn (associated name), oth (other), ctb (contributor), dte (dedicatee), fnd (funder), auth (author), isb (issuing body), egr (engraver), hnr (honoree), ill (illustrator), prt (printer).

mikegerber commented 11 months ago

Duplicate of #20.