kpdecker / jsdiff

A javascript text differencing implementation.
BSD 3-Clause "New" or "Revised" License
7.69k stars 491 forks source link

Simplify tokenization logic in diffWords #494

Closed ExplodingCabbage closed 4 months ago

ExplodingCabbage commented 4 months ago

This eliminates all the logic about rejoining boundary splits. It was dumb; we were doing an initial split into tokens based on a Unicode-naive notion of "word characters", and then bandaging some words back together with a Unicode-aware regex, when we could instead just use the Unicode-aware regex to start with.

I actually wrote this intending it to just be a refactor that would make it easier to make further fixes to diffWords, but I think I inadvertently fixed two bugs along the way! Namely, those bugs are:

The test failure below is what you get if you run the new test on master (with a removeEmpty call added, like there used to be), and demonstrates both behaviour changes:

      AssertionError: expected [ Array(35) ] to deeply equal [ Array(33) ]
      + expected - actual

         "\n"
         "\n"
         "\n"
         "  "
      -  "anim"
      -  "á-"
      +  "animá"
      +  "-"
         "los"
      -  "\r"
      -  "\n"
      -  "\r"
      -  "\n"
      +  "\r\n"
      +  "\r\n"
         "("
         "wibbly"
         " "
         "wobbly"

Resolves #311.