tallforasmurf / PPQT

A post-processing tool for PGDP written in Python, PyQt4, and Qt
GNU General Public License v3.0
4 stars 2 forks source link

Use LANG attribute for alt dict code #146

Closed tallforasmurf closed 11 years ago

tallforasmurf commented 11 years ago

Right now an alt dictionary is specified with a nonstandard tag <sd tag> which is a bad idea for several reasons, notably, you have to remember to remove them before you do finalize the etext or do html conversion on that fork -- at which time you lose the work that went into placing the tags in the first place!

What PPQT should do is note the use of lang='code' and translate from the HTML language codes (www.w3.org/TR/html401/struct/dirlang.html#h-8.1) to dictionary tags. Then the user could enter for example <span lang='fr'>...</span> or <p lang='de'> and these could be left as-is when converting the html document (still have to be removed from the ascii of course).

Issues to consider:

  1. where do you put the translation from lang='xx' to a dict tag value? Probably in the makeSpellCheck class in pqSpell. It stores the list of available tags.
  2. what to do if there is no obvious translation? Probably, query to the user. With a popup list of available tags.
  3. support xml:lang='foo'? Sure, just look for <.+lang=['"](.+)['"] and it falls out without further effort.
  4. do we look for <html [xml:]lang='foo'> at the head of the document and use it to set the default dict?
  5. do we look for http-equiv='Content-Language' content='fr_CA' and use it for a default dict? Probably not because spellcheck applies much earlier in the work flow before there is an http header.
  6. alternatively, would we insert <html lang='foo'> based on the current dict in html autoconvert document?
tallforasmurf commented 11 years ago

Commit 608ebb76 implements this. Since the standard seems to allow lang= on any tag, I just changed the code for <sd dict>..</sd> to generalize to <anytag lang='xxxx' ... > ... </sametag>. As before, what happens during the word census is that the dict-tag gets appended to the word: amour/fr_FR which is how it shows up in the word table.

The "xxxx" has to be a dict tag, e.g. lang='en_GB' which is maybe not exactly what the standard says -- it is unclear and the relevant RFC is "to be replaced" by a newer one that is currently a 404 error! But it is much simpler for me and for the user to specify language as a dict-tag, a list of which is available anytime under View> dictionary. And it is unambiguous. You can leave the lang attribute in place in the HTML version. Should it turn out that you need some code other than the dict tag, a global replace fixes it.