tallforasmurf / PPQT

A post-processing tool for PGDP written in Python, PyQt4, and Qt
GNU General Public License v3.0
4 stars 2 forks source link

Adding a language tagging interface #131

Closed bibimbop closed 11 years ago

bibimbop commented 11 years ago

Current PGDP suggestion is to transform i tag to em / cite / abbr tags, and add the language if necessary.

For instance:

Gallia Christiana

becomes

Gallia Christiana

I would like to have an interface to do that. Here's a suggestion:

To avoid useless input, the drop down menus should only have the tags / languages that have been selected. They could be in a comma/space separated list at the top of the tab. The choice of adding lang and/or xml:lang should be given.

It should be possible to go again in the tab to consult / fix errors.

This interface could even be more generic and allow to search for any tag or set of tags, and sort them by tag / content / language. This is something I do during PP; I have a simple python script that does that.

tallforasmurf commented 11 years ago

Gracious me. I've always been a fan of semantic tagging as opposed to format tagging, but never thought much beyond preferring em or cite to i. I do not see anything about this in the formatting guidelines.

Searching further, I find the wiki page on PGTEI esp. the part on italics, but here the coding is different from what you show, e.g. a foreign tag: This text is in <foreign lang="fr" rend="font-style: italic">une langue étrangère</foreign>.

I was not at all aware of PGTEI. I can appreciate its goals. I suspect I need to give it proper, or at least more, support in PPQT. I fear that will lead to quite a lot of rethinking and new interfaces. Your suggestion would only be a first cut at it. Unfortunately, the wiki page Moving from DP Formatting to PGTEI is only a stub!!!

Searching further I find wiki Accessible HTML Books regarding language, this says "Just put the lang (and xml:lang in XHTML) onto any other tag that surrounds the passage, sentence or word that's in a different language, such as a <blockquote>, <q> or <span>. This looks more like what you describe.

Right away I see I should start looking for (regex) \blang[=:]['"]?(\w+) and use that in place of my ad-hoc alternate-spell-dict syntax. (Supposing that the XML abbreviations for languages match up to the dictionary names!)

Regarding i/em/cite/strong tags, I note that in both PGTEI and HTML you only need to mark the phrases that are not in the book's base language. If you have <html lang='fr'> in HTML and <language id='fr'>in your TEI header all you need is <em> except for an <em lang='xx'> other than base situation.

Anyway -- I am going to make this a V2 goal and do some more research. I'd like to come in with a really good support structure for TEI before guiguts does, it might give PPQT a little more traction. Please comment further especially when you see any good links to forum threads etc. on this topic.

bibimbop commented 11 years ago

Take a look at http://www.pgdp.org/~jana/best-practices/ in the Inline Formatting section.

regexes are not good enough. I just don't want to have to check several times the same sentence inside a tag. Once is enough. Plus it's less error inducing. Also isolating elements allows visual checking by presenting the document a different way. Here's an example of an error in a book that I caught with such a tool:

... (i) (la) Gesta Caroli Magni ad Carcassonam et Narbonam (i) (la) Gesta Caroli Magni ad Narbonam et Carcassonam ...

These are book titles mentionned in the footnotes, separated by 100 pages. One of them is wrong. I couldn't have found this error another way.

Concerning TEI, I don't think that's a priority since AFAIK no one uses it, and no one cares. Displacing guiguts is going to take a while (ppqt is missing some functionalities, guiguts is the incumbent that almost everyone uses).

tallforasmurf commented 11 years ago

This is now implemented somewhat: you can put a lang="dict-tag" attribute in any html (eg span, div...) and that triggers use of an alternate dictionary.

The dict-tag values such as en_GB and de_DE may or may not conform to the language naming convention RFC that is in progress, but it is still valid X/HTML.