nlbdev / pipeline

NLB branch of the super-project that aggregates all Pipeline related code. See https://github.com/daisy/pipeline for the main branch.
http://repo.nlb.no/pipeline
4 stars 1 forks source link

Hyphenation exeption list #208

Open KariRudjord opened 6 years ago

KariRudjord commented 6 years ago

Make a solution for hyphen exeptions, e.g. (never to hyphenate the word) de-tte de-nne di-sse kj-eks

bertfrees commented 5 years ago

The method that we used before to enter exception words required that the base table did not contain any patterns with 9 in it. This is because 9 is the highest priority, so it's not possible to undo certain hyphenation points with exceptions. It might work fine but you must realize it's not guaranteed.

There are of course other ways to implement an exception list. For example by checking for exception words before passing words to libhyphen. That sounds very straightforward but is not so great for a number of reasons so we have to think it through.

By the way, I noticed that you inserted your exception list of non-standard hyphenations in hyph_nb_NO.dic directly, whereas we used to concat the two files while building. Not really anything wrong with that, except that there is duplication now.

One thing that bothers me is that running substrings.pl on your new hyph_nb_NO.dic yields changes to the file. I expected it to add some new patterns because of the addition of the non-standard hyphenation patterns, but I didn't expect that it would change the whole file (and not only the ordering). As far as I understand that means that the file was not prepared correctly.

Finally, what is that last line: "Binærfil (standard inndata) samsvarer"?

josteinaj commented 5 years ago

Would it be possible to for instance replace all the standard hyphenation 8s with 7, replace all the non-standard hyphenation 9s with 8, and then use 9 for the exception words? Do you think that would break things?

We could go back to how they were concatenated before. I didn't think about that.

Yeah I have no idea how it was prepared, only that it works better than what we had before.

Oh, whoops. "Binærfil (standard inndata) samsvarer" is from a diff and means "Binary file (standard in) matches" (or however it's formulated in the english locale). I suppose I performed a diff at some point. That should be removed.

bertfrees commented 5 years ago

The fact that the non-standard patterns have 9 in them is not a problem because these already represent full words, so they can't and don't need to be overwritten by exceptions.

The problem is that the base table has 9 in it as well. You can't change any of the numbers because all levels are taken.

josteinaj commented 5 years ago

Yeah. I don't know how the implementation works but would it be a problem to move patterns with 9s to 8 even though there already are other patterns using 8? I would expect the behavior to change slightly but maybe not that much?

bertfrees commented 5 years ago

I think you need to read this.

bertfrees commented 5 years ago

As always I think it would be better to start from the source code if possible instead of patching up the resulting table. It was the case when we used the "spell-norwegian" project and it is the case now. Of course there needs to be source code in the first place, and you need to be able to get hold of it. As you know it is not always possible to get in touch with the author. There are a bunch of people mentioned in the header of the No.pm file of the Text-Hyphen-No project, but it is not immediately clear who you would contact.

By the way maybe you should include the header in our copy too. In case anyone else stumbles upon our copy it would be helpful for them.

One thing that bothers me is that running substrings.pl on your new hyph_nb_NO.dic yields changes to the file.

Thinking about this again, this is of course because the patterns in the Text-Hyphen-No project are TeX patterns, not Hyphen patterns. The Text::Hyphen perl module has nothing to do with the Hyphen library.

josteinaj commented 5 years ago

I grep'ed all the files in spell-norwegian-2.2.tar.gz and found e-mail addresses for three norwegians. I sent a request to each of them now. :crossed_fingers:.

josteinaj commented 5 years ago

I got a response from Rune Kleveland (translated by me):

The hyphenation files are made with a program called patgen which is bundled with TeX. There are many rules that can be used to generate patterns in the Makefile in the patterns folder in the file

https://sourceforge.net/projects/spell-no/files/ispell-norsk/ispell-norsk%202.0/

To build the list from source you need a lot of things, among other things ispell. There are many complicated rules in the Makefile, and you'll probably have to use Linux for it to work.

The challenge with hyphenation is mainly compound words. To improve this one needs several compound words that are actually used. If you add it to the list you'll get better patterns with the build scripts in the distribution.

There are two levels for hyphenation. One file that divides word into compounds/parts (barnehage-tante, barne-hage), and one that divides into components (bar-ne-hage-tan-te). This can be used to collect compund words because the rules that divide into compound words can be used to divide compound words that can be collected. And then you'll have to avoid sydame-rikaner, pils-piss and similar divisions.

There is actually quite a bit of documentation in the file. And there are newer versions as well, but I don't really think they have worked a lot with hyphenation.

I've downloaded the CVS repo for spell-no from sourceforge, converted it to git, and uploaded it here in case we want to make changes to it:

https://github.com/nlbdev/spell-no

So as I understand it, it's the norsk.words file that is the input (which we already tried), but we need to get the Makefile working so that it generates the word components.

bertfrees commented 5 years ago

Cool! Yes, this is kind of what I remember from reading the README some years ago. I never tried to build it because the Makefile seemed super complicated. But maybe now that we can ask Rune Kleveland for help we should give it a try.

Do you know what the connection is between this project and the TeX file that you found in the Perl project?

bertfrees commented 5 years ago

The things we should definitely try to do after we manage to build it are:

bertfrees commented 5 years ago

The Makefile in the spell-norwegian-2.1 project looks a bit newer than the one in spell-no that you got from CVS.

bertfrees commented 5 years ago

I've been looking at the code a bit. The part of the build that interests us the most is in the patterns subdir. It says:

The purpose of this script is to generate hyphenation patterns for use with TeX based on a dictionary hyphenated at compound points and a pattern file which handles non-compound words.

In other words, it looks like the basis of the whole thing is an existing patterns file. But the patterns themselves are not used in the final output, rather they are used to hyphenate the word list, and from this new patterns are generated. This means we could add our own exception words to the process. Also we can decide the hyphen levels in such a way that at least one level is available to add the non-standard patterns at the end. (We can't add them sooner in the process because this part is specific to Libhyphen.)

The README suggests two solutions for when hyphenation fails on words (not in the dictionary). The first one I don't fully understand. The second solution basically adds the word to a list of exceptions which are checked at runtime (TeX's \hyphenation{...} command). I'd much rather generate proper patterns though, and it looks like this is possible.

This project solves non-standard hyphenation too, but apparently it is done via some TeX configuration file. I don't think we have to reevaluate our approach, but it's something to keep in mind.

At the top of the Makefile it says that you need a patgen with enough capacity. Luckily we can build patgen from source in case we need to make adjustments to some parameters in the code.

josteinaj commented 5 years ago

Do you know what the connection is between this project and the TeX file that you found in the Perl project?

No, I don't.

Maybe we could add a Dockerfile to https://github.com/nlbdev/spell-no with a build environment?

bertfrees commented 5 years ago

OK, sure. Although the build prerequisites are almost non-existing. You only need common Unix tools like awk, sed, gzip, etc., and we will probably build patgen ourselves, and the patgen build is also fully self-contained.

By the way I compared the "patterns" directories inside the spell-no repo and the newer spell-norwegian-2.1 and the conclusion is that the differences are negligible, so we can proceed with the spell-no repo.

bertfrees commented 5 years ago

The norsk.words file however has become much bigger in spell-norwegian-2.1, so it's a good idea to update it in spell-no.

bertfrees commented 5 years ago

I have a good understanding now of how the build works. I had to do a few modifications in order to get it working on my Mac OS machine and with my version of patgen. I also had to add a rule to create a patterns file for Libhyphen. Before I proceed with the other planned changes, like adding exception words, support for non-standard hyphenation, etc., we should discuss a few things (see Slack).

bertfrees commented 5 years ago

I did a lot of work for this issue last year in December, but there is still some work to do before we can use it in Pipeline notably:

I also would like to:

Has anyone tried my tool to check for mistakes or missing words in the norsk.words and norsk.singlewords files?

josteinaj commented 5 years ago

I really like the tool you made. One thing is that we need to be sure that we keep it in sync with our latest build, in case we make changes to the norsk.words/norsk.singlewords/etc. files. Maybe it could be built together with the rest of the system (maybe it already is, I haven't looked at it since last year), and then I could expose the file on the docker container running Pipeline 2 here. Ideally, it would be almost interactive (edit file - check results - edit again - etc.), but I suspect it's not worth spending time on that. Non-standard hyphenation is something we certainly need yes.