This eliminates all the logic about rejoining boundary splits. It was dumb; we were doing an initial split into tokens based on a Unicode-naive notion of "word characters", and then bandaging some words back together with a Unicode-aware regex, when we could instead just use the Unicode-aware regex to start with.
I actually wrote this intending it to just be a refactor that would make it easier to make further fixes to diffWords, but I think
I inadvertently fixed two bugs along the way! Namely, those bugs are:
that when tokenizing text containing Windows-style newlines, the carriage return and line feed characters would each get their own token instead of being grouped into a single newline token
The test failure below is what you get if you run the new test on master (with a removeEmpty call added, like there used to be), and demonstrates both behaviour changes:
This eliminates all the logic about rejoining boundary splits. It was dumb; we were doing an initial split into tokens based on a Unicode-naive notion of "word characters", and then bandaging some words back together with a Unicode-aware regex, when we could instead just use the Unicode-aware regex to start with.
I actually wrote this intending it to just be a refactor that would make it easier to make further fixes to
diffWords
, but I think I inadvertently fixed two bugs along the way! Namely, those bugs are:The test failure below is what you get if you run the new test on master (with a removeEmpty call added, like there used to be), and demonstrates both behaviour changes:
Resolves #311.