the-siesta-group / edfio

Read and write EDF/EDF+ files.
Apache License 2.0
25 stars 5 forks source link

Replace invalid characters in string decoding #4

Closed cbrnr closed 8 months ago

cbrnr commented 8 months ago

According to the EDF standard, header fields can only contain (printable) ASCII characters. edfio relaxes this constraint to UTF-8 upon reading, which enables the package to read non-standard EDF files (which are very common in the wild).

However, it might still be the case that people populate header fields with other encodings (such as Latin-1) that are incompatible with UTF-8. For example:

>>> "é".encode("latin1").decode()
UnicodeDecodeError

This PR adds the errors="replace" argument, which adds support for all non-UTF-8 characters, but replaces them with .

>>> "é".encode("latin1").decode(errors="replace")
'�'
hofaflo commented 8 months ago

Nice idea, thanks @cbrnr!

To make sure we don't break this in the future, could you add a testcase? I think we don't need to introduce a non-compliant test file just for this, something like that should be fine:

sig = EdfSignal(np.arange(2), 1)
sig._label = "è".ljust(16).encode("latin-1")
assert sig.label == "�"

Also please add an entry to the changelog (and feel free to credit yourself in it)!

cbrnr commented 8 months ago

Done. I added a more direct test of decode_str(), which spares us from having to import EdfSignal and numpy.

hofaflo commented 8 months ago

Perfect, thanks @cbrnr!