latex3 / latex2e

The LaTeX2e kernel
https://www.latex-project.org/
LaTeX Project Public License v1.3c
1.96k stars 267 forks source link

LaTeX does not hyphenate some words with Romanian accented characters #29

Closed rsandu2007 closed 6 years ago

rsandu2007 commented 6 years ago

Hello,

Brief outline of the bug

The following test file demonstrates that LaTeX refuses to hypenate some words with Romanian accented characters. In this file, the word "oricărei" at the end of a line generates an "Overfull \hbox" error instead of hypenating

The word's correct hypenation is "ori-că-rei".

Romanian language needs the full UTF-8 set, not just "latin1" or "latin2". The accented character in the test file is ă (small a with breve, Unicode U+0103).

Putting \hypenation{ori-că-rei} in the preamble gives a more serious error (the demonstation file will not compile).

By replacing "ă" with "a" inside the word "oricărei", the errors goes away and the word will be hypenated correctly. The LaTeX environment I've used is pdfTeX 3.14159265-2.6-1.40.17 (TeX Live 2016)

Minimal example showing the bug

\documentclass[10pt,a4paper,twoside]{book}
\usepackage[romanian]{babel}
\usepackage[utf8]{inputenc}

\begin{document}
\paragraph{}Editorii și tehnoredactorii au depus eforturi susținute pentru înlăturarea oricărei erori, de fond sau de formă. Unele greșeli minore descoperite în manuscris au fost corectate tacit.
\end{document}

Log file (required) and possibly PDF file

test.log test.pdf

eg9 commented 6 years ago

That's an unfortunate situation, but not a bug. The available fonts don't provide the “comma below” accent, so it has to be emulated with macros that disallow hyphenation past them.

You can get slightly better results if you load

\usepackage[T1]{fontenc}

so at least ă, â and î are treated as single characters, making them cooperate in hyphenation. But without a dedicated Romanian font with glyphs for ș and ț these will not be considered for hyphenation.

For full support of Romanian, it's better to switch to XeLaTeX or LuaLaTeX.

A better place where asking for help and advice is a LaTeX forum or https://tex.stackexchange.com

FrankMittelbach commented 6 years ago

as mentioned there is nothing we can really do about this as it depends on glyph availability in the fonts

rsandu2007 commented 6 years ago

Thank you for your kind answer!

IMHO, no software package should claim "support for Romanian language" unless it supports the full range of the Romanian diacritics. These diacritics can be now entered directly from the keyboard, according to the official Romanian keyboard layout, which is standardised since 2004.

Here's the complete list:

ă Ă â Â î Î ș Ș ț Ț „ ” - glyphs for the Romanian language itself (including down-up double quotes) ß đ Đ ł Ł - glyphs for ethnic minorities living in Romania, not for the Romanian language itself (German, Polish) € § © « » - glyphs for special signs that are not present on an US keyboard

Romanian diacritics (not too many of them, compared to French, Czech or Italian...) are always comma-below (șȘțȚ): please note that the cedilla-below versions, even if visually similar, are not part of the Romanian language (Romanian has no cedilla).

One more question, please: where should I discuss the matter of missing fonts (it is not clear for me where do these fonts come from)? Could you please point me to some to a specialised discussion list?

Comment: Presently, there is a HUGE quantity of Romanian written material (books, articles, OCR-sed books, etc.) waiting to benefit from LaTeX typesetting, for converting to digital form and/or publish online. However, there is a significant (read: discouraging...) amount of manual work (hacking) to do in order to prepare the simplest template for a book, article or thesis:

Having a good-quality filter to export LaTeX files to HTML5 is also immensely important, since an individual that assumes the work of typesetting a large book will immediately want to export it to a webserver and/or to ePub.

Presently, this amount of manual hacking a BIG deterrent for using LaTeX in Romanian language, especially for non-technical individuals (such as literary authors or editors, non-technical students). In order to work in practice, more of all these features should come prepackaged in distros like Fedora, Ubuntu, Debian, etc., just by declaring „Romanian” as the language of the document class. Until we achieve that, most authors would be tempted to prepare their works directly in HTML5 (even by writing in a simple text file with manual HTML5 markup).

Thanks - best regards,

Răzvan

FrankMittelbach commented 6 years ago

When I started with TeX (and later with LaTeX) the situation that you experience was exactly the same, i.e., all diacritics in the German language (in fact all diacritics whatsoever) where only available via the \accent primitive and that prevented hyphenation in that part of the word. Nevertheless people developed hyphenation patterns and made language support files like german.sty and later babel even though there have been such restrictions.

Babel is a framework. User communities maintain and add language support and I think it is quite proper to support that even if it may have restrictions. As the core team we have no chance to do that support ourselves that has to come from the community and people that understand the language. Are you expecting us to make a judgement call and say Romanian is not supported/available because it is from a quality perspective at the point other languages have been 20 years ago?

In case of Romanian it looks to me as if all that Johannes ever got was translations from the main strings in a document but not more so that is what babel is able to do at this point, which is better than nothing but understandably not enough for a "proper" language support.

On the other hand some points you claim are needed for Romanian language aren't really language and even if they are customary with many documents in your country they shouldn't be hardwired ina language support but provided by document classes tailored for that type of design.

jbezos commented 6 years ago

I'm not sure, but I think Romanian is not currently maintained by an "external" contributor, so I could add direct support for the comma-below variants, much as in Latvian and as explained on:

https://tex.stackexchange.com/questions/118080/babel-romanian-or-c

FrankMittelbach commented 6 years ago

@jbezos it is not supported externally or so it seems; but is is not that the comma-below accents aren't there and need support (they got added to LaTeX and that pr you cite is 4 years old) but that those glyphs are made-up when standard fonts are used (as they do not exist as single characters in those fonts (and other than providing new fonts and encodings there is not much you can do about that can you?)

jbezos commented 6 years ago

@FrankMittelbach Right (I didn't check before if things had changed -- to many languages to remember ;-)), but anyway I wonder if \bbl@allowhyphens could help. I must investigate.

rsandu2007 commented 6 years ago

Thank you all for your kind interest and support! :)

The question raised by Mr. Mittelbach deserves a more detailed response (explanation):

As a Romanian GNU/Linux longtime user (day-by-day, exclusively GNU/Linux, at desktop level...), I do my best to support various free software projects, especially their smooth adaptation to Romanian language (in order to lower the entry barrier for the general public). Of a particular interest for me are the tools dealing with text and publication (OCR, converting books and various valuable articles and texts in digital form, library work, etc.).

Unfortunately, I am not a developer myself, but a network administrator (dipl. engineer), so I cannot directly contribute to code. I regulary write articles for the Romanian press, do translations (articles, documentation and GUIs), OCR, contribute to lexicons, write Wikipedia pages, as well as participating in various Romanian communities for education (online and offline), talking to students, etc.

For this kind of work, LaTeX is one of the main (and best) tools - along with web authoring tools, search/SEO tools like YaCy, automatic translation engines, etc..

For the Romanian community, the real problem is the total lack of interest in Free Software from the OFFICIAL authorities (ministries, national standardization bodies, public education system, public libraries etc.). For reasons I won't discuss here, this ministry "support" is offered in a twisted way, so the non-technical users will come to the conclusion that "it's better to stay in Windows and commercially-available software". Coupled with the failing apetite for reading (worldwide) and a less-than-optimal education sytem, the overall effect is devastating.

That doesn't mean, by no way, that Romanians are less interested in Free Software, LaTeX and other like projects, compared to other nations I know. It's just about numbers (national population), the use of language worldwide (Romanian is not English or Spanish...) and, as I said, the official authorities IGNORING national community and real, day-yo-day matters. Being ignored by the national standardisation bodies, the community of Romanian users is often foreced to relay on international support for improving the quality of the tools.

Hands-on, what we are trying to achieve is more traction for LaTeX in Romanian universities and schools, as well as editors (more professors preparing works in LaTeX, more students writing into it on a daily basis, more editors raising standards for their typesetting). For real work, the easiness for using such a tool is KEY for the matter (especially for non-technical users). Please think of a literary author or a Romanian language professor (non-technical) (otherwise accustomised with Microsoft Word), typesetting their daily works in LaTeX. It has to be STRAIGHTFORWARD, no room for much technicalities...

So It's a kind of chicken-and-egg problem. To have more Romanians involved and actilvely helping the international LaTeX community (doing localisation, creating and adapting new fonts, writing relevant scripts, translating documents, providing help etc.) we need a larger mass of Romanian LaTeX users. To have this larger mass, we need better support of Romanian language in the tools themselves (otherwise, regular academic and publshing people will be strongly discouraged to use them, from the very beginning).

Thanks again, Răzvan

P.S. Following your kind advice, I was able to achieve a MUCH better result writing my documents in XeLaTeX mode (fontenc, etc.) and by declaring FreeSerif as default fonts. I still struggle with other aspects - such as a. blank lines between paragraphs not going away even if I set parskip to 0 and b. TOTAL failure of converting my work to HTML, even at a minimal, decent level (I expect author John Doe will want to easily/magically convert his long-work, elaborate thesis in HTML5, to publish it online). One per one, I still suspect that the overall work of making a LaTeX system to work smoothly for a Romanian thesis is still FAR beyond the reach of the average, non-technical author (myself being a day-by-day GNU/Linux user since 1998 and a computer engineer).

rsandu2007 commented 6 years ago

FYI, I've accidentally tumbled over this academic paper (PDF), written in English by a Romanian fellow, that explains various conventions of STANDARD Romanian typesetting, as well as the LaTeX custom settings he used for achieving them.

http://www.icvl.eu/2012/disc/icvl/documente/pdf/soft/ICVL_SoftwareSolutions_paper08.pdf

It basically boils down to:

By all means, Romanian is a Latin language, using a small number of special conventions and diacritic glyphs, very similar to French, Italian or even Czech - not some sort if Cyrillic or Asian language... :-)

If possible, would be nice to have these kind of settings included by default in babel and other packages that claim "Romanian language support", so that they become entirely automatic when one declares, say, \usepackage[romanian]{babel}

Thanks a lot, Răzvan

2018-04-07 13:41 GMT+03:00 Răzvan Sandu rsandu2007@gmail.com:

Thank you all for your kind interest and support! :)

The question raised by Mr. Mittelbach deserves a more detailed response (explanation):

As a Romanian GNU/Linux longtime user (day-by-day, exclusively GNU/Linux, at desktop level...), I do my best to support various free software projects, especially their smooth adaptation to Romanian language (in order to lower the entry barrier for the general public). Of a particular interest for me are the tools dealing with text and publication (OCR, converting books and various valuable articles and texts in digital form, library work, etc.).

Unfortunately, I am not a developer myself, but a network administrator (dipl. engineer), so I cannot directly contribute to code. I regulary write articles for the Romanian press, do translations (articles, documentation and GUIs), OCR, contribute to lexicons, write Wikipedia pages, as well as participating in various Romanian communities for education (online and offline), talking to students, etc.

For this kind of work, LaTeX is one of the main (and best) tools - along with web authoring tools, search/SEO tools like YaCy, automatic translation engines, etc..

For the Romanian community, the real problem is the total lack of interest in Free Software from the OFFICIAL authorities (ministries, national standardization bodies, public education system, public libraries etc.). For reasons I won't discuss here, this ministry "support" is offered in a twisted way, so the non-technical users will come to the conclusion that "it's better to stay in Windows and commercially-available software". Coupled with the failing apetite for reading (worldwide) and a less-than-optimal education sytem, the overall effect is devastating.

That doesn't mean, by no way, that Romanians are less interested in Free Software, LaTeX and other like projects, compared to other nations I know. It's just about numbers (national population), the use of language worldwide (Romanian is not English or Spanish...) and, as I said, the official authorities IGNORING national community and real, day-yo-day matters. Being ignored by the national standardisation bodies, the community of Romanian users is often foreced to relay on international support for improving the quality of the tools.

Hands-on, what we are trying to achieve is more traction for LaTeX in Romanian universities and schools, as well as editors (more professors preparing works in LaTeX, more students writing into it on a daily basis, more editors raising standards for their typesetting). For real work, the easiness for using such a tool is KEY for the matter (especially for non-technical users). Please think of a literary author or a Romanian language professor (non-technical) (otherwise accustomised with Microsoft Word), typesetting their daily works in LaTeX. It has to be STRAIGHTFORWARD, no room for much technicalities...

So It's a kind of chicken-and-egg problem. To have more Romanians involved and actilvely helping the international LaTeX community (doing localisation, creating and adapting new fonts, writing relevant scripts, translating documents, providing help etc.) we need a larger mass of Romanian LaTeX users. To have this larger mass, we need better support of Romanian language in the tools themselves (otherwise, regular academic and publshing people will be strongly discouraged to use them, from the very beginning).

Thanks again, Răzvan

P.S. Following your kind advice, I was able to achieve a MUCH better result writing my documents in XeLaTeX mode (fontenc, etc.) and by declaring FreeSerif as default fonts. I still struggle with other aspects - such as a. blank lines between paragraphs not going away even if I set parskip to 0 and b. TOTAL failure of converting my work to HTML, even at a minimal, decent level (I expect author John Doe will want to easily/magically convert his long-work, elaborate thesis in HTML5, to publish it online). One per one, I still suspect that the overall work of making a LaTeX system to work smoothly for a Romanian thesis is still FAR beyond the reach of the average, non-technical author (myself being a day-by-day GNU/Linux user since 1998 and a computer engineer).

jbezos commented 6 years ago

Thank you for the link. I'll have a look at it to see which adjustments are related to babel.

Javier

rsandu2007 commented 6 years ago

2018-04-07 20:32 GMT+03:00 Javier Bezos notifications@github.com:

Thank you for the link. I'll have a look at it to see which adjustments are related to babel.

Thank you for your interest and help! :)

I think that, in descending order of importance, the most essential adaptations for the Romanian language are:

Ideally, the documentation for Romanian should mention somewhere that, for this language, one must use the XeTeX engine in order to easily cope with the UTF-8 issues (otherwise, may users will repeat the trial-and error game I did, ad infinitum :) ). BTW, for my document (a high-school course, with regular text, images and mathematical formulas) I've put:

\usepackage{fontenc} \usepackage{fontspec} \fontspec{FreeSerif}

which pretty much solved the issue of diacritics.

Răzvan