ksylva / PDL_2020-2021_GR6

Wikipedia Matrix
0 stars 0 forks source link

Encoding issues #13

Open ksylva opened 3 years ago

ksylva commented 3 years ago

Some characters with accent is not extract correctly. By example ĉ, û,... is extract like as ?..

TheoEssoh commented 3 years ago

We have not the problem of caracters under Linux. we noticed the same problem with several windows computers. under windows, so I try to solve the problem in three different ways: first, after extraction of the data i change the question marks by the correct caracter manually. The tests are ok but it is a bad solution. second i change the content of the witness file by the content of extract file with the question marks. the tests are ok but it's also a very very bad solution. And third, is to set the file appropreate encoding. i have try many encoding (utf-8, utf-16, windows 1252, us-ascii, iso-8859-1...) but the problem of caracters was not resolve.

TheoEssoh commented 3 years ago

characters such as ĉ, ĝ, ĥ, ĵ, ŝ, ŭ,ア... are replaced by ?, ?, ?, ?, ?, ?, ?.... Also some expressions have been modified on the site. for example the expression « International Phonetic Alphabet » was remplaced by « IPA » .

TheoEssoh commented 3 years ago

the chosen solution is to update the witness file manually by looking at the page in question. Because the pages have been modified.