Open unhammer opened 2 years ago
@flammie did look into memory consumption for pmhfst files a while ago. Maybe he has some ideas.
Mm, there are some things that legit multiplied the automaton size, eg. upcase in https://github.com/giellalt/lang-sme/commit/5e0bdaff32ae9292d73a580030a4627f789526fc. There aren't too many other commits in the history, but many are filling up alphabet and alphabet size can easily be a multiplier in tokeniser size, I was hoping list arcs fix it a bit but it wasn't too effective. I think there might be a way to automate this with git bisect especially if keeping the analyser_relabelled-blah size constant might reveal something more...
Is this issue something we want to keep open? @flammie 's use of list arcs didn't help much, and my understanding is that the only thing left to do is a rewrite of parts of the hfst-pmatch
code: in Karttunen's paper on pmatch
, one of the features of the Xerox implementation is that it should save (disc and memory) space by storing reused FST constructs as references instead of copying in them in every instance. And the same goes for some built-in text manipulation functions, such as uppercasing.
To me this indicates that although the Hfst implementation is true to the original in linguistic features, it is not when it comes to implementation stuff that impacts memory consumption. And I believe this is a rather big omission on the Hfst part.
At the same time it is a major effort to rewrite the code, so I suggest that we for the time being just accepts the situation as it is, and close this issue.
Any thoughts?
Well, it would be interesting to try to bisect and find out what commits were responsible for the jumps in size – are they all necessary, or could there be some low-hanging fruit? OTOH if there aren't currently plans to run it locally on phones or combine with other fst's then it's probably not a problem in practice, just an annoyance, so closing makes sense.
Where did we go wrong?