roedoejet / g2p

Grapheme-to-Phoneme transductions that preserve input and output indices, and support cross-lingual g2p!
https://g2p-studio.herokuapp.com
Other
119 stars 26 forks source link

running g2p convert on "large" files takes forever... #350

Closed marctessier closed 3 months ago

marctessier commented 3 months ago

I had a 4k line files I wanted to g2p convert . After waiting for a very longs time , I decided to split my input file into 100line chunks and try that while waiting on my original job to finish. Processing the 40 x 100line chunks was completed in no time compared to processing the 4k original that was still processing.

Something is wrong or working very inefficiently.

joanise commented 3 months ago

Oh boy, you're extending the intent of this functionality in a serious way! The fact that we accept a file at all, instead of only text on the command line, was a hidden hack, which only became documented because some users discovered it by accident. Now you want it to be efficient! LOL

The issue is probably due to the fact we load the entire file in memory and then process it. If large files is a real use case, we could convert it to line-oriented processing.

dhdaines commented 3 months ago

I think it's more likely due to the way we apply rules which is really quite "pessimal" as each rule gets applied successively on the entire input, so O(MN) and if either M or N gets big then that can take a while.

If we didn't want to accept arbitrary Python regular expressions as inputs, we could compile them into a minimal deterministic FST ;-)

EDIT: there are probably also some side-effects of the way g2p works that mean that an FST would not necessarily be equivalent, either :-(

EDIT again: processing a line at a time, or, say, a token at a time, will also help a lot!

joanise commented 3 months ago

@marctessier I get it now. I just did g2p convert filelist-text-only.txt eng eng-arpabet for testing Aidan's big EV PR, and boom, it's taking a long time!

@dhdaines I would be very surprised if g2p was convertible to an FST. I imagine your big-Oh analysis is in the right ballpark, and clearly line-oriented (or token-oriented, as you suggest) processing will make a big difference. I guess token would be best, since that's less likely to trip on long lines, but line is easiest to implement...

joanise commented 3 months ago

@dhdaines I think the way we build the transduction graph, by inserting and removing nodes and edges, is going to be part of the issue: each operation probably costs O(N), with N ~ the length of the input.

joanise commented 3 months ago

@marctessier You'll appreciate #358