inveniosoftware / invenio-previewer

Invenio module for previewing files.
https://invenio-previewer.readthedocs.io
MIT License
5 stars 59 forks source link

JSON: incorrect detection of UTF-8 #176

Closed dfdan closed 1 year ago

dfdan commented 1 year ago

Package version (if known): (current)

Describe the bug

JSON files containing UTF-8 with only sparse unicode characters are not reliably detected as such. This happens if the first unicode character doesn't occur until > PREVIEWER_CHARDET_BYTES bytes (only 1k by default) -

https://github.com/inveniosoftware/invenio-previewer/blob/fc3e5d2656d7f503ee6d393567d6b1396fbf37db/invenio_previewer/utils.py#L27

Steps to Reproduce

  1. Attach JSON >1kB with utf8 content / only ASCII characters occur in the first 1kB
  2. Attempt to preview
  3. Preview fails with 500 error -
[2023-05-12 12:29:49,649] ERROR in app: Exception on /records/3c9xq-hsq89/preview/iiif_manifest.json [GET]
...
File "/srv/rdm/adc/invenio-previewer/invenio_previewer/extensions/json_prismjs.py", line 28, in render
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3244890: ordinal not in range(128)

Expected behavior

Sparsley unicode containing files should preview correctly.

I suggest that we fix utils.py to override ascii detection with utf-8 - this should be safe?

tmorrell commented 1 year ago

We've experienced the same issue with txt files https://github.com/inveniosoftware/invenio-app-rdm/issues/1864, and I second the suggestion to default to utf8 instead of asci. The previewers themselves are trying to default to utf8, so detect_encoding should respect that default.