snaekobbi / pipeline-mod-braille

ARCHIVED. Please don't make any new issues or pull requests in this repo.
0 stars 0 forks source link

Dutch hyphenation #46

Open bertfrees opened 9 years ago

bertfrees commented 9 years ago

See https://github.com/snaekobbi/issues/issues/2 for the various options for implementing a hyphenator.

bertfrees commented 9 years ago

Maybe a useful tip from CBB (Christelijke Bibliotheek voor Blinden en Slechtzienden): they use a version of hyph_nl_NL.dic from OpenTaal.

dkager commented 9 years ago

The OpenTaal data sounds promising. Will look at this next week and maybe you can fill me in on the best way to implement this in mod-braille (from what I read there is OpenOffice data available for this dict).

dkager commented 9 years ago

I'm guessing this is the hyphenation dictionary from OpenTaal.org that CBB is using. Maybe I can use the same approach as in snaekobbi/issues#2 for this? I don't have test data yet, so integrating the dictionary into mod-braille could be done first.

bertfrees commented 9 years ago

The dictionary you linked is the one that is already included in Pipeline. I think CBB was maybe referring to an updated version. We'd have to ask them.

We need test data before we can do anything else. Then, if you need to modify the dictionary, it's best you copy the file to a new project (like Jukka did with pipeline-mod-celia) because the dictionary from LibreOffice is downloaded and packaged automatically.

dkager commented 9 years ago

I believe the OpenTaal data dates from 2011, but I'll see if I can confirm this with someone from CBB. What sort of test data are you looking for?

bertfrees commented 9 years ago

Hyphenated words I guess. I understand you may not have that kind of data just lying around. But if there's nothing to test then our job is done. Then we just take what's currently available. I think at the minimum we should have a small test, if only so we can easily add more to it later. Jukka's test data is also very limited, but it's easy to add more. He did it in pipeline-mod-celia because that's were his dictionary lives, but we could have your tests in functional-testing.

dkager commented 9 years ago

So if I understand this correctly, we have:

And we need:

For Finnish the test data is in the JUnit test case. I could clone this into another module, but think it would be a bit nicer to have something similar to liblouis' harness tests for this. I.e. experts only worry about JSON or some other format and the JUnit tests pull these in and run them.

Also, which of the three libs (Libhyphen, Hyphenator, TexHyphenator) should we use?

bertfrees commented 9 years ago

I suggest we use XML instead of JSON. Something like this. If everybody includes test data in that format in the functional-testing repo, then I can have one test (JUnit or XSpec) that runs them all. Of course from the point of view of the developer it is nice to have to tests closer to the implementation, but since you don't intend to modify the dictionary yet for the time being, that's not a problem. Later we can still copy/move the test to its own module.

Which of the libraries we should use is not so important I think. What I've done with Finnish is I convert the patterns into several formats at build time so that several implementations become available in DP2. As long as all implementations behave the same (which they should in theory, and we easily can test each of them with the same test data) we don't have to worry about which one is actually used.