tokeniser-gramcheck-gt-desc.pmhfst is 211M

unhammer commented 2 years ago

In April 2020 it was 106M
In September 2018 it was 39M

Where did we go wrong?

snomos commented 2 years ago

@flammie did look into memory consumption for pmhfst files a while ago. Maybe he has some ideas.

flammie commented 2 years ago

Mm, there are some things that legit multiplied the automaton size, eg. upcase in https://github.com/giellalt/lang-sme/commit/5e0bdaff32ae9292d73a580030a4627f789526fc. There aren't too many other commits in the history, but many are filling up alphabet and alphabet size can easily be a multiplier in tokeniser size, I was hoping list arcs fix it a bit but it wasn't too effective. I think there might be a way to automate this with git bisect especially if keeping the analyser_relabelled-blah size constant might reveal something more...

snomos commented 10 months ago

Is this issue something we want to keep open? @flammie 's use of list arcs didn't help much, and my understanding is that the only thing left to do is a rewrite of parts of the hfst-pmatch code: in Karttunen's paper on pmatch, one of the features of the Xerox implementation is that it should save (disc and memory) space by storing reused FST constructs as references instead of copying in them in every instance. And the same goes for some built-in text manipulation functions, such as uppercasing.

To me this indicates that although the Hfst implementation is true to the original in linguistic features, it is not when it comes to implementation stuff that impacts memory consumption. And I believe this is a rather big omission on the Hfst part.

At the same time it is a major effort to rewrite the code, so I suggest that we for the time being just accepts the situation as it is, and close this issue.

Any thoughts?

unhammer commented 10 months ago

Well, it would be interesting to try to bisect and find out what commits were responsible for the jumps in size – are they all necessary, or could there be some low-hanging fruit? OTOH if there aren't currently plans to run it locally on phones or combine with other fst's then it's probably not a problem in practice, just an annoyance, so closing makes sense.

giellalt / lang-sme

tokeniser-gramcheck-gt-desc.pmhfst is 211M #52