kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
64 stars 63 forks source link

Source files with wrong encoding #6324

Open stweil opened 1 week ago

stweil commented 1 week ago

The latest code contains two property files with ISO-8859-1 encoding:

They were found with find * -type f | xargs file --mime | grep iso-8859-1.

Here is a complete list of all encodings in the current directory:

% find * -type f | xargs file --mime | sed 's/.*charset/charset/' | sort | uniq -c
 115 charset=binary
   2 charset=iso-8859-1
1511 charset=us-ascii
 195 charset=utf-8

So there are already nearly 200 files with UTF-8 encoding (which is fine), and the two files mentioned above are the only ones with wrong encoding.

henning-gerhardt commented 1 week ago

You know that file is not a good tool for getting the encoding as file is only checking a small amount of content of a file and after "detecting" the first non ASCII encoding (or what ever the default is) this is reported back and maybe other encodings in the same file get not detected?

ISO-8859-1 for the German and Spanish resources files are may correct as you did not need anything more to display the used characters correct. It even can show that the result of file is not correct in any case.

stweil commented 1 week ago

@matthias-ronge, @joergleh, you contributed the ISO-8851-1 encodings (#5214, #5903). Did you test the messages in Kitodo.Production? Did they look correct in the frontend? Would they look different with UTF-8 encoding?

henning-gerhardt commented 1 week ago

@matthias-ronge, @joergleh, you contributed the ISO-8851-1 encodings (#5214). Did you test the messages in Kitodo.Production? Did they look correct in the frontend? Would they look different with UTF-8 encoding?

This is not an issue as the files read as an UTF-8 file and as ISO-8859-1 is part of UTF-8 there should no display issues.

stweil commented 1 week ago

This is not an issue as the files read as an UTF-8 file and as ISO-8859-1 is part of UTF-8 there should no display issues.

I'd prefer a test to verify your claim. Only the first 128 characters (ASCII) are identical in both encodings, so there will be differences for umlauts and Spanish characters which are not part of ASCII.

danilopenagos commented 3 days ago

@matthias-ronge, @joergleh, you contributed the ISO-8851-1 encodings (#5214, #5903). Did you test the messages in Kitodo.Production? Did they look correct in the frontend? Would they look different with UTF-8 encoding?

Hi, everyone! The frontend in Spanish display all Spanish characters and messages correctly in the current version we are working with!

stweil commented 3 days ago

That's surprising. I had expected that this subset of Spanish messages is not shown correctly.

danilopenagos commented 3 days ago

That's surprising. I had expected that this subset of Spanish messages is not shown correctly.

These messages are showed in English. Our version is the 3.5. I don't know which encoding file (ISO-8859-1 or UTF-8) this version use. image

stweil commented 3 days ago

Strange again. Release 3.5.0 should contain the Spanish translation. Are you running a local build or an official release from GitHub?

solth commented 2 days ago

It may be that the installation @danilopenagos refers to uses a custom messages directory containing outdated property files that do not contain the Spanish translations in question and that are used instead of the message files distributed with the official 3.5.0 release. That should be checked.