hy-tira / tirakirja

Kurssikirja Helsingin yliopiston kurssille Tietorakenteet ja algoritmit
30 stars 8 forks source link

PDF Nordic letter handling #35

Open 3Rton opened 1 year ago

3Rton commented 1 year ago

The pdf downloadable from Github (and distributed by my teacher for our course) has issues with Nordic letter (Ä. Ö, Å) representation. When opening the pdf in chrome the letters render fine. However, highlighting the letters and trying to copy paste out of the pdf is problematic. For example the second sentence from foreword "Alkusanat" comes out looking like this "Ensimm¨ainen vaihe on oppia ohjelmoinnin perustaidot, kuten miten k¨aytet¨a¨an muuttujia, ehtoja, silmukoita ja taulukoita."

This has some unsavory consequences for TTS (text-to-speech): the apps typically fail to recognize the words because of this erroneous representation and default to spelling out the word letter by letter. This makes the pdf borderline unusable for both blind people and people who just wish to review chapters in public transit or such. Especially modern deeplearning TTS solutions (such as Google's WaveNet) are capable of a very natural audiobook like narration, so it is quite a shame that the pdf doesn't work for this.

In my testing importing the github pdf to Microsoft Word also breaks all the nordic letters, however replacing them and then re-exporting as pdf generates a pdf that is copy-paste'able and works with TTS. It might just be export setting related issue? For the Word exported one PDF/A compliant and Document structure tags for accessibility were enabled but it is hard to say if this alone is the reason. (Word also defaults to Cambria when importing the pdf so font might also play a part)

Would be interesting to hear if other people can verify similar behaviour with copy pasting from the pdf and TTS.

akirataguchi115 commented 1 year ago

I can reproduce this issue. From my experience this is an issue with the other books of the department. Maybe something to do with the templates used for creating these books?