tibonto / DFG-Fachsystematik-Ontology

DFG Fachsystematik Ontology - DFG Classification of Scientific Disciplines, Research Areas, Review Boards and Subject Areas
3 stars 1 forks source link

Explain origin and generation of CSV/ automate procedure #14

Closed SArndt-TIB closed 4 months ago

SArndt-TIB commented 5 months ago

The repo currently does not explain how the csv file is created, e.g. whether it results from an automated processes.

andrecastro0o commented 5 months ago

@SArndt-TIB I will be trying to describe the actions I think I need to take in order to create a CSV from the .xlsx files

In each of the Excel (.xlsx) files

  1. remove rows 1,2 (title, empty)
  2. remove empty columns A and G
  3. save both DE and EN sheets into a CSV (to allow the following operations) 5.1 CSV export: Check "Quote all text cells" so that we avoid issues with commas within the cells 5.2 from this point onward we shall only work on the CSVs and not the .xlsx)

in CSVs (easier to edit and see in plain sight

  1. add headers EN: Subject Number and Subject for column A, B . DE: Fachnummer, Fach
  2. add to header (row 1) "Subject Area" and "Scientific Discipline" in columns D, E
  3. remove header rows (except row 1): 57, 137, 169
  4. remove empty rows (search in column A)
  5. fill-in the missing values (in Review Board, Subject Area, Scientific Discipline columns) - this is tedious but important, as we cannot reply on merged cells in the CSV. And it is at the core of the tree structure (@SArndt-TIB let me knows if this needs clarification)

join both CSVs

andrecastro0o commented 5 months ago

Bugs in CSV - manual clean up needed

detected by (breaking) scripts/create_ontology.py

Error in 11 Humanities

python scripts/create_ontology.py csv/2024-2028/Fachsystematik_2024-2028.csv

SECTION: 0 Scientific Discipline
INDEX: 0 COL:Scientific Discipline CELL: 1 
Humanities and Social Sciences
CELL ID: <<<<1>>>
CURRENT: 1 - Humanities and Social Sciences
PARENT: <<<None>>>
Class: https://github.com/tibonto/dfgfo/1 labels: ['Humanities and Social Sciences', 'Geistes- und Sozialwissenschaften']

SECTION: 1 Subject Area
INDEX: 1 COL:Subject Area CELL: 11 Humanities
Traceback (most recent call last):
  File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 96, in <module>
    cell_id, cell_label = split_id_label(id_n_label=row[tree_hierarchy[index]])
  File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 34, in split_id_label
    id, label = id_n_label.split('\n')
ValueError: not enough values to unpack (expected 2, got 1)

Issue: Unlike other "Subject Area" values that separate NN Subject with a line break, 11 Humanities only uses space as a separator:

Fix: Search & Replace in CSV 11 Humanities for "11\nHumanities"

Commit: b857e8c8dfb980fb2407a8b3d92bd6cb64d67fc9

Error on 2.31

SECTION: 2 Review Board
INDEX: 2 COL:Review Board CELL: 2.31
Agriculture, Forestry and Veterinary Medicine
id_n_label: 2.31
Agriculture, Forestry and Veterinary Medicine
id_n_label: 2.31
Agrar-, Forstwissenschaften 
und Tiermedizin
Traceback (most recent call last):
  File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 99, in <module>
    cell_id_de, cell_label_de = split_id_label(id_n_label=cell_de)
  File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 35, in split_id_label
    id, label = id_n_label.split('\n')
ValueError: too many values to unpack (expected 2)

Issue: 2.31 has 2 line breaks, when it should only have 1, between number and term

2.31
Agrar-, Forstwissenschaften 
und Tiermedizin

Fix: Search & Replace in CSV Agrar-, Forstwissenschaften\nund Tiermedizin for "Agrar-, Forstwissenschaften und Tiermedizin"

commit: 3c4448653f8ed0a5570a53c66a7675b7194b6088

Error on 34 Geowissen-schaften

Issue:

SECTION: 1 Subject Area
INDEX: 1 COL:Subject Area CELL: 34
Geosciences 
id_n_label: 34
Geosciences 
id_n_label: 34
Geowissen-
schaften 
Traceback (most recent call last):
  File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 99, in <module>
    cell_id_de, cell_label_de = split_id_label(id_n_label=cell_de)
  File "/home/acastro/Documents/external_projects/DFG-Fachsystematik-Ontology/scripts/create_ontology.py", line 35, in split_id_label
    id, label = id_n_label.split('\n')
ValueError: too many values to unpack (expected 2)

Same issue as previous. Extra line break between words.

Fix:

Remove line break between words.

"34\nGeowissen-\nschaften " replaced with "34\nGeowissen-schaften "

Errors detect by ROBOT: label and annotation formating

It is likely that the will some bugs in the CSV - mostly label_formatting , label_whitespace and annotation_whitespace errors, that will be detected by ROBOT in Ontology testing Github Actions.

The recommendation is to fix these errors in the CSV (the working document). I use a plain-text/code editor do those edits, as it allows me to see better the text, and use search and replace. The following commits are examples of these edits https://github.com/tibonto/DFG-Fachsystematik-Ontology/pull/16/commits/6130123dfc9864203a9649b349cee86a4afe90ab https://github.com/tibonto/DFG-Fachsystematik-Ontology/pull/16/commits/f03a411d1287464a16d7ce04ee1bd088c983d107 https://github.com/tibonto/DFG-Fachsystematik-Ontology/pull/16/commits/f7e0bb33fede3f1a5e2e6bb11f8c882027afcc55

Example ROBOT errors:

ERROR   label_formatting    https://github.com/tibonto/dfgfo/2.23-10    rdfs:label  Clinical Psychiatry, Psychotherapy, 
ERROR   label_whitespace    https://github.com/tibonto/dfgfo/1.13   rdfs:label   Art History, Music, Theatre and Media Studies@en
ERROR   label_whitespace    https://github.com/tibonto/dfgfo/1.16-01    rdfs:label  Social and Cultural Anthropology and Ethnology @en
SArndt-TIB commented 5 months ago

@andrecastro0o I started to add a readme for this on branch 13, I will continue this with your info from this issue. Probably not today, though.

andrecastro0o commented 4 months ago

@SArndt-TIB shall we close this issue? If there will be more related about it we can re-open or create a new one

andrecastro0o commented 4 months ago

Issue addressed in #16