mookid / diffr

Yet another diff highlighting tool
MIT License
574 stars 22 forks source link

Improve performance for diffs with very large hunks #14

Open mookid opened 5 years ago

mookid commented 5 years ago

See https://scripter.co/git-diff-minified-js-and-css/

On the repository https://github.com/gohugoio/hugoDocs, git log -p -- themes/gohugoioTheme/static/dist/app.bundle.js | diffr takes ~9min on my machine.

One problematic commit: 0960c5fb36e9abcf3b492a523943268e89066543

Explore possible workarounds:

mlncn commented 2 years ago

I'm hitting diffs where git add -p can take hours, and it took a while to realize diffr was the cause— i had gotten used to beautiful, useful diffs and forgotten i was not on git standard.

One thing i do not understand is why it has seemingly suddenly gotten much worse for me. As in, in just the last week, went from never thinking about it (effectively instant) to being completely unusable, across multiple projects of the kind i always work with. Some underlying tool that might have gotten an update?

Anyhow strong vote for some way to detect and bail on long lines. Maybe something configurable, and to not try to diff lines longer than 500 characters by default.

mookid commented 2 years ago

sorry about the bad experience. of course, if you have a reproduction case, I would be very happy to have a look.

An relatively easy way to proceed would be to periodically check for maximal processing time for a given hunk, and bail if necessary (and fallback to the default diff processing)

I will most likely no do the coding myself (my company is not very supportive of open source at all) but I can provide some guidance if anyone would want try to tackle it

Hi-Angel commented 1 year ago

Part of a reason for the slowness might be a massive amount of generating excess escape sequences. See, e.g. this two lines change looks like this with the escape sequences:

image

Clearly, most of them are unneeded.

mookid commented 1 year ago

I think that the slow cases are bad cases of the Myers algo, which is quadratic in number of tokens if the before/after are very different. I don't think that the added control chars are enough to make it slower by itself.

CervEdin commented 1 year ago

I experience the issue with VERY large diffs. 500 lines was fine but 1k+ it becomes a problem.

Here's a script to reproduce the issue

tr -dc a-z1-4 </dev/urandom |
  tr 1-2 ' \n' |
  awk 'length==0 || length>50' |
  tr 3-4 ' ' |
  sed 's/^ *//' |
  cat -s |
  sed 's/ / /g' |
  fmt |
  head -n 2000 > /tmp/random-1.txt &&
  tr -dc a-z1-4 </dev/urandom |
  tr 1-2 ' \n' |
  awk 'length==0 || length>50' |
  tr 3-4 ' ' |
  sed 's/^ *//' |
  cat -s |
  sed 's/ / /g' |
  fmt |
  head -n 2000 > /tmp/random-2.txt  &&
  git diff --no-index /tmp/random-1.txt /tmp/random-2.txt |
  diffr
mookid commented 1 year ago

hi @CervEdin

thanks for the report! I have a patch for that, which is very simple. unfortunately, my company does not really allow open source work, so this is stuck there for a while.

CervEdin commented 1 year ago

Hi @mookid

That's great! However, I'm a bit perplexed. Aren't you the maintainer?

mookid commented 1 year ago

I am. I can provide some help to someone motivated to write the code, but can't submit it myself.

The Myers algorithm is incremental: if find the shortest path of length D for each D increasing; my patch just stops after D large-ish (if D is large, the diff will have poor readability annyway) and just falls back to "remove everything, add everything".

CervEdin commented 1 year ago

Fair enough. Maybe I take a look at some point. For now I just switched back the pager of things like git log to diff-highlight

On Thu, Apr 20, 2023, 4:17 AM Nathan Moreau @.***> wrote:

I am. I can provide some help to someone motivated to write the code, but can't submit it myself.

The Myers algorithm is incremental: if find the shortest path of length D for each D increasing; my patch just stops after D large-ish (if D is large, the diff will have poor readability annyway) and just falls back to "remove everything, add everything".

— Reply to this email directly, view it on GitHub https://github.com/mookid/diffr/issues/14#issuecomment-1515616491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXL5ESXXHC5NYXPZZO7DE3XCCMCLANCNFSM4IFNDB6Q . You are receiving this because you were mentioned.Message ID: @.***>

mookid commented 1 year ago

@CervEdin #96 should help with your problem, probably (your script does not work on my system for some reason, I get tr: Illegal byte sequence when running it)

CervEdin commented 1 year ago

@mookid thanks! Maybe the test script I wrote is using something non POSIX. It works in WSL and git-bash.

I was able to test that 9bfc9f264739c2324d7fcd1f66038d06b67ff116 allows using --large-diff-threshold 200 , to skip. But I don't think I saw a mention of the flag in --help.

I'm wasn't able to cargo install --path . in master, geting issues with is_terminal.

error[E0658]: use of unstable library feature 'is_terminal'
 --> src/cli_args.rs:7:5
  |
7 | use std::io::IsTerminal;
  |     ^^^^^^^^^^^^^^^^^^^
  |
  = note: see issue #98070 <https://github.com/rust-lang/rust/issues/98070> for more information

Thank you for adding this feature, it's going to make the tool much more usable, at least for me ♥

mookid commented 1 year ago

I was able to test that 9bfc9f264739c2324d7fcd1f66038d06b67ff116 allows using --large-diff-threshold 200 , to skip. But I don't think I saw a mention of the flag in --help.

no, I made it a 'private' flag for now

I'm wasn't able to cargo install --path . in master, geting issues with is_terminal.

you need a recent-enough version of the rust toolchain, as is_terminal has been stabilised relatively recently; or use the unstable version: cargo +nightly install --path . probably works (not tested)

Thank you for adding this feature, it's going to make the tool much more usable, at least for me ♥

<3