Closed SArndt-TIB closed 4 months ago
@SArndt-TIB I will be trying to describe the actions I think I need to take in order to create a CSV from the .xlsx files
Subject Number
and Subject
for column A, B . DE: Fachnummer
, Fach
Subject Number
Subject
Review Board
Subject Area
Scientific Discipline
Fachnummer
Fach
Fachkollegium
Fachgebiet
Wissenschaftsbereich
11 Humanities
python scripts/create_ontology.py csv/2024-2028/Fachsystematik_2024-2028.csv
SECTION: 0 Scientific Discipline
INDEX: 0 COL:Scientific Discipline CELL: 1
Humanities and Social Sciences
CELL ID: <<<<1>>>
CURRENT: 1 - Humanities and Social Sciences
PARENT: <<<None>>>
Class: https://github.com/tibonto/dfgfo/1 labels: ['Humanities and Social Sciences', 'Geistes- und Sozialwissenschaften']
SECTION: 1 Subject Area
INDEX: 1 COL:Subject Area CELL: 11 Humanities
Traceback (most recent call last):
File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 96, in <module>
cell_id, cell_label = split_id_label(id_n_label=row[tree_hierarchy[index]])
File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 34, in split_id_label
id, label = id_n_label.split('\n')
ValueError: not enough values to unpack (expected 2, got 1)
Issue:
Unlike other "Subject Area" values that separate NN Subject
with a line break, 11 Humanities
only uses space as a separator:
Fix:
Search & Replace in CSV 11 Humanities
for "11\nHumanities"
Commit: b857e8c8dfb980fb2407a8b3d92bd6cb64d67fc9
SECTION: 2 Review Board
INDEX: 2 COL:Review Board CELL: 2.31
Agriculture, Forestry and Veterinary Medicine
id_n_label: 2.31
Agriculture, Forestry and Veterinary Medicine
id_n_label: 2.31
Agrar-, Forstwissenschaften
und Tiermedizin
Traceback (most recent call last):
File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 99, in <module>
cell_id_de, cell_label_de = split_id_label(id_n_label=cell_de)
File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 35, in split_id_label
id, label = id_n_label.split('\n')
ValueError: too many values to unpack (expected 2)
Issue: 2.31 has 2 line breaks, when it should only have 1, between number and term
2.31
Agrar-, Forstwissenschaften
und Tiermedizin
Fix:
Search & Replace in CSV Agrar-, Forstwissenschaften\nund Tiermedizin
for "Agrar-, Forstwissenschaften und Tiermedizin"
commit: 3c4448653f8ed0a5570a53c66a7675b7194b6088
Issue:
SECTION: 1 Subject Area
INDEX: 1 COL:Subject Area CELL: 34
Geosciences
id_n_label: 34
Geosciences
id_n_label: 34
Geowissen-
schaften
Traceback (most recent call last):
File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 99, in <module>
cell_id_de, cell_label_de = split_id_label(id_n_label=cell_de)
File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 35, in split_id_label
id, label = id_n_label.split('\n')
ValueError: too many values to unpack (expected 2)
Same issue as previous. Extra line break between words.
Fix:
Remove line break between words.
"34\nGeowissen-\nschaften "
replaced with "34\nGeowissen-schaften "
It is likely that the will some bugs in the CSV - mostly label_formatting
, label_whitespace
and annotation_whitespace
errors, that will be detected by ROBOT in Ontology testing Github Actions.
The recommendation is to fix these errors in the CSV (the working document). I use a plain-text/code editor do those edits, as it allows me to see better the text, and use search and replace. The following commits are examples of these edits https://github.com/tibonto/DFG-Fachsystematik-Ontology/pull/16/commits/6130123dfc9864203a9649b349cee86a4afe90ab https://github.com/tibonto/DFG-Fachsystematik-Ontology/pull/16/commits/f03a411d1287464a16d7ce04ee1bd088c983d107 https://github.com/tibonto/DFG-Fachsystematik-Ontology/pull/16/commits/f7e0bb33fede3f1a5e2e6bb11f8c882027afcc55
Example ROBOT errors:
ERROR label_formatting https://github.com/tibonto/dfgfo/2.23-10 rdfs:label Clinical Psychiatry, Psychotherapy,
ERROR label_whitespace https://github.com/tibonto/dfgfo/1.13 rdfs:label Art History, Music, Theatre and Media Studies@en
ERROR label_whitespace https://github.com/tibonto/dfgfo/1.16-01 rdfs:label Social and Cultural Anthropology and Ethnology @en
@andrecastro0o I started to add a readme for this on branch 13, I will continue this with your info from this issue. Probably not today, though.
@SArndt-TIB shall we close this issue? If there will be more related about it we can re-open or create a new one
Issue addressed in #16
The repo currently does not explain how the csv file is created, e.g. whether it results from an automated processes.