huridocs / uwazi

Uwazi is a web-based, open-source solution for building and sharing document collections
http://www.uwazi.io
MIT License
242 stars 80 forks source link

Updates on the 2001 Huridocs microthesauri (English and translations) #7431

Open mayeulk opened 3 weeks ago

mayeulk commented 3 weeks ago

Hi! I found two numeric sources (ready to import in Uwazi or other databasee) of the 2001 Huridocs microthesauri.

1) The GoogleSheets listed at: https://huridocs.org/resource-library/monitoring-and-documenting-human-rights-violations/microthesauri/

(this is used by https://github.com/huridocs/uwazi/pull/2423)

2) https://github.com/huridocs/OpenEvSys/blob/master/schema/mysql-dbpopulate-mt.sql

(in addition to the pdf files micro-thesauri in various languages, such as: https://huridocs.org/wp-content/uploads/2020/12/mictheenfinalpdf.pdf - not easy to import in a DB).

Are there more recent versions of these microthesauri? (For English and/or translations). I think there are issues with these, some of the issues being:

Or: 10101050000 Murder (deliberate killing which ought to be seen as a common criminal act) in French "Meurtre (assassinat délibéré devant être regardé comme un acte criminel normal)" "normal" is strange here (in French, it can mean "an acceptable behouviour") and could be incorrect; possibly, it should rather be "crime de droit commun" ? (see e.g.: https://www.linguee.com/english-french/search?source=auto&query=common+crime) e.g. "common" = not specialized law (such as the specialized law for e.g. terrorist crimes)

In a single Russian cell, there are two languages and a note saying " (translation missing)" (actually, the translation is not missing): "Опубликование закона или политики, снижающих гарантии свободы веру/мнения Promulgation of law or policy which reduces guarantees for freedom of belief/opinion (translation missing)" (sic)

I saw some details on how to understand some terms in mictheenfinalpdf.pdf but it is not alwasy sufficient, and it is not organised in a way that helps put it in Uwazi or other database.

If there are not (yet) more recent versions, is there a place/process for people to contribute? (given the OpenEvSys github repo is now read only)

Thank you. Mayeul Kauffmann

mayeulk commented 3 weeks ago

Related question: Are those identical? (except for the number of translations) The GoogleSheets listed at: https://huridocs.org/resource-library/monitoring-and-documenting-human-rights-violations/microthesauri/ The code at: https://github.com/huridocs/OpenEvSys/blob/master/schema/mysql-dbpopulate-mt.sql

mayeulk commented 3 weeks ago

Another issue, seen in GoogleSheet "27-Types of Responses" https://docs.google.com/spreadsheets/d/1cETLZGG9gmdkvCfHlGQVzKBX8pFheLs8-MlzmHm5IQk/edit?gid=0#gid=0 It wrongly states "This is a short list with one level", while https://huridocs.org/wp-content/uploads/2020/12/mictheenfinalpdf.pdf correctly states "This is a hierarchical list with three levels". (The rest of the description on row 1 of the GoogleSheets seems identical to the pdf). From what I can see, some manual conversion and editing has been used to convert the PDF into the various GoogleSheets. Eg in sheet "18-Degrees of Involvement": "THis is a hierarchical list", with a upper-case H in "THis", which indicates re-typing the word (the case in the PDF is correct: "This"). "25-Status as Victim" https://docs.google.com/spreadsheets/d/1QbJGS27Sme3DjLXvfEW0bQsBAkRCr7l7hsDioSXjhYU/edit?gid=0#gid=0 "abou the latest or present status" ("abou" without 't') (mictheenfinalpdf.pdf reads: "about")

Retyping text is inherently prone to errors (as compared to semi-automated conversion with strict quality control steps).

Sometimes, row 1 of a GoogleSheets has the full text of the PDF (description of type), sometimes only a small part of it (e.g. "4-Types of Acts" has one third of the PDF description).

mayeulk commented 3 weeks ago

In sheet "10-Occupations (ILO Categories)" https://docs.google.com/spreadsheets/d/1VgdQeQ_RtZkYvy71-9IOsnWZk74tk7RFR0CdBFErhaI/edit?gid=0#gid=0 many Spanish translations are in the wrong row: cell D206 "Trabajadores voluntarios en cooperativas" seems to be an alternate translation for: cell B227 "Volunteer worker in co-operative" Then, all Spanish translations between D206 and D227 are off by one row (e.g. B213=Other , D214=Otros)

mayeulk commented 3 weeks ago

Sheet "24-Types of Perpetrators" https://docs.google.com/spreadsheets/d/1l3gxmX-T2bpFs0fV1ZjVTvZzHMyLgzsiu9ZGsmsGsu0/edit?gid=0#gid=0 English "Spouses" is incorrectly translated into French as "Epouses" (which means: "wives", that is "female spouses"; a spouse can be wife or husband). Same in 3-Rights Typology: Equality of spouses (Right to) / Egalité des épouses (droit à l')

Wrong language in a translation field is frequent in several sheets. For instance in sheet 6, codes 080200000000, 160200000000 and 170700000000; sheet 14 code 980000000000 : the "Spanish" field is in French.

sheet 14 code 980000000000 : the "Indonesian" field is in Arabic.

mayeulk commented 3 weeks ago

Importing microthesauri could be made easier, by improving the page https://huridocs.org/resource-library/monitoring-and-documenting-human-rights-violations/microthesauri/ "Types of Perpetrators" is made of three links, attached to these strings:

HTML source code: <a href="[https://docs.google.com/spreadsheets/d/1l3gxmX-T2bpFs0fV1ZjVTvZzHMyLgzsiu9ZGsmsGsu0/edit?usp=sharing](view-source:https://docs.google.com/spreadsheets/d/1l3gxmX-T2bpFs0fV1ZjVTvZzHMyLgzsiu9ZGsmsGsu0/edit?usp=sharing)">T</a><a href="[https://docs.google.com/spreadsheets/d/1l3gxmX-T2bpFs0fV1ZjVTvZzHMyLgzsiu9ZGsmsGsu0/edit?usp=sharing](view-source:https://docs.google.com/spreadsheets/d/1l3gxmX-T2bpFs0fV1ZjVTvZzHMyLgzsiu9ZGsmsGsu0/edit?usp=sharing)" target="_blank" rel="noreferrer noopener">y</a><a href="[https://docs.google.com/spreadsheets/d/1l3gxmX-T2bpFs0fV1ZjVTvZzHMyLgzsiu9ZGsmsGsu0/edit?usp=sharing](view-source:https://docs.google.com/spreadsheets/d/1l3gxmX-T2bpFs0fV1ZjVTvZzHMyLgzsiu9ZGsmsGsu0/edit?usp=sharing)">pes of Perpetrators</a>

mayeulk commented 3 weeks ago

In the GoogleSheets, in most cases the first column is labelled "huri_code" (singular), except: Microthesaurus 2 (2-Violations Typology). Header of first column is: huri_codes Microthesaurus 10 (10-Occupations (ILO Categories)). Header of first column is: huri_codes Microthesaurus 15 (15-Geographical Terms). Header of first column is: "huri_codes (add 0 in the beginning)"

mayeulk commented 3 weeks ago

In most microthesauri, huri_code for "other" is: "900000000000"

For GoogleSheet 3-Rights Typology, huri_code for "other" is: "090000000000" (with a leading zero, and 10 zeroes on the right, not 11).

while the PDF says (page 4): "Finally, HURIDOCS has assigned the code 90 and the term "Other"..."

In GoogleSheet "19-Source Connection to Information", "Other" is missing while it could be helpful. (Note: "Other" is missing also in 7, 26, 41, 44, 45, 48 but these seems fine to me as in theory there can't be 'other' in these cases).