sagemathinc / cocalc

CoCalc: Collaborative Calculation in the Cloud
https://CoCalc.com
Other
1.14k stars 207 forks source link

latex: detect non utf8 encoding and convert file #7584

Open haraldschilly opened 1 month ago

haraldschilly commented 1 month ago

This problem happens for an "IEEE conference template", which is encoded as ISO-8859. But this is a more general issue that could happen with any tex file coming from outside of CoCalc.

  1. Get the template from https://www.ieeesmc2024.org/call-for-paper by scrolling down and clicking on the "LATEX TEMPLATE" button
  2. extract it in CoCalc (I did in in a terminal, unzip ieeeconf.zip and open root.tex)

Observe there are broken chars in the sources + errors in line 61 and onwards:

Screenshot from 2024-05-27 13-41-31

Switching the engine to "xelatex" in the build/select engine dropdown, at least gets rid of the errors:

Screenshot from 2024-05-27 13-42-46


It's unclear what the aim of this ticket is. The expected behavior is certainly that there are no such bad characters. Maybe as a first step, we should just figure out how the tex file could be converted, such that these characters are cleaned up. (workaround is below). The actual fix is probably to run file ... if it is a new file, and convert it automatically to UTF8. I think it's too hard to change the editor itself to switch the encoding specific to a file.

haraldschilly commented 1 month ago

Workaround

$ file root.tex 
root.tex: LaTeX 2e document, ISO-8859 text, with very long lines (902), with CRLF line terminators

reveals this is for windows, or something like that. Converting it to UTF-8 fixes this:

$ iconv -f ISO-8859-1 -t UTF-8 root.tex -o root-utf8.tex