Closed jerinphilip closed 3 years ago
I've completely re-written most of it, switching from libpcrecpp to libpcre2. I've just pushed my fixes to the master branch. Due to the use of std::string_view compilation now requires C++17.
You should be able to update the submodule as follows:
cd path/to/ssplit-submodule
git pull
To commit the update to your branch, use something along the lines of
git commit -m "Updated ssplit." path/to/ssplit-submodule
It is important NOT to include the trailing slash for this.
Just wanted to bring up the issue that I opened yesterday.
Should we consider eliminating the requirement of pre-installed libpcre2 on user's system to be able to use ssplit-cpp? We can add pcre2 sources in 3rd-party, compile it and link the generated pcre2 in this project. This will make ssplit-cpp self contained.
I replaced my linesplitter with what ssplit provides now (https://github.com/browsermt/bergamot-translator/commit/38e8b3cd6d5a2db561ce201c3e69fb79c676389c)
Currently getting runtimes that are more or less same as using my reduced implementation referenced above. This issue is resolved for the bergamot-translator use-case. Closing.
For the use-case of sentence mode to scan input lines from WNGT20 data without having to replace ssplit. If it's already specified that input is one-sentence-per-line, it makes sense to integrate a faster newline/scan based implementation instead of the regex based one.
The current implementation is regex based and slow (1000 sentences a minute). For WNGT20 data (1M) this takes forever to preprocess. I dropped in my own LineSplitter and verified this is indeed ssplit.
https://github.com/browsermt/bergamot-translator/commit/b25b2276e35cf7f0079a254793a042b714efdbcf
Opening this issue so I can refer to it in future benchmark/bergamot discussions.