One sentence per line splitting is slower than it should be

ugermann / ssplit-cpp

Approximate reimplementation of the sentence splitter from the Moses toolkit.

Other

4 stars 9 forks source link

One sentence per line splitting is slower than it should be #5

Closed jerinphilip closed 3 years ago

jerinphilip commented 3 years ago

For the use-case of sentence mode to scan input lines from WNGT20 data without having to replace ssplit. If it's already specified that input is one-sentence-per-line, it makes sense to integrate a faster newline/scan based implementation instead of the regex based one.

The current implementation is regex based and slow (1000 sentences a minute). For WNGT20 data (1M) this takes forever to preprocess. I dropped in my own LineSplitter and verified this is indeed ssplit.

https://github.com/browsermt/bergamot-translator/commit/b25b2276e35cf7f0079a254793a042b714efdbcf

Opening this issue so I can refer to it in future benchmark/bergamot discussions.

ugermann commented 3 years ago

I've completely re-written most of it, switching from libpcrecpp to libpcre2. I've just pushed my fixes to the master branch. Due to the use of std::string_view compilation now requires C++17.

You should be able to update the submodule as follows:

cd path/to/ssplit-submodule
git pull

To commit the update to your branch, use something along the lines of git commit -m "Updated ssplit." path/to/ssplit-submodule

It is important NOT to include the trailing slash for this.

abhi-agg commented 3 years ago

Just wanted to bring up the issue that I opened yesterday.

Should we consider eliminating the requirement of pre-installed libpcre2 on user's system to be able to use ssplit-cpp? We can add pcre2 sources in 3rd-party, compile it and link the generated pcre2 in this project. This will make ssplit-cpp self contained.

jerinphilip commented 3 years ago

I replaced my linesplitter with what ssplit provides now (https://github.com/browsermt/bergamot-translator/commit/38e8b3cd6d5a2db561ce201c3e69fb79c676389c)

Currently getting runtimes that are more or less same as using my reduced implementation referenced above. This issue is resolved for the bergamot-translator use-case. Closing.