Simplify tokenization logic in diffWords

This eliminates all the logic about rejoining boundary splits. It was dumb; we were doing an initial split into tokens based on a Unicode-naive notion of "word characters", and then bandaging some words back together with a Unicode-aware regex, when we could instead just use the Unicode-aware regex to start with.

I actually wrote this intending it to just be a refactor that would make it easier to make further fixes to diffWords, but I think I inadvertently fixed two bugs along the way! Namely, those bugs are:

https://github.com/kpdecker/jsdiff/issues/311
that when tokenizing text containing Windows-style newlines, the carriage return and line feed characters would each get their own token instead of being grouped into a single newline token

The test failure below is what you get if you run the new test on master (with a removeEmpty call added, like there used to be), and demonstrates both behaviour changes:

      AssertionError: expected [ Array(35) ] to deeply equal [ Array(33) ]
      + expected - actual

         "\n"
         "\n"
         "\n"
         "  "
      -  "anim"
      -  "á-"
      +  "animá"
      +  "-"
         "los"
      -  "\r"
      -  "\n"
      -  "\r"
      -  "\n"
      +  "\r\n"
      +  "\r\n"
         "("
         "wibbly"
         " "
         "wobbly"

Resolves #311.

kpdecker / jsdiff

Simplify tokenization logic in diffWords #494