jish / pre-commit

A slightly improved pre-commit hook for git
https://jish.github.io/pre-commit/
Other
796 stars 96 forks source link

filter_files spends a lot of time in the binary? check #255

Closed jfly closed 7 years ago

jfly commented 7 years ago

I changed filter_files to add some timestamping:

      def filter_files(files)
        start = Time.now #<<<
        new_files = files.reject do |file|
          !File.exists?(file) ||
          File.directory?(file) ||
          (
            size = File.size(file)
            size > 1_000_000 || (size > 20 && binary?(file))
          ) ||
          repo_ignores.any? { |ignore| File.fnmatch?(ignore, file) }
        end
        stop = Time.now #<<<
        puts "filter_files over #{files.length} files took #{stop - start}" #<<<
        new_files #<<<
      end

When running on my codebase, filter_files takes 42 seconds!

> time bundle exec pre-commit run git
filter_files over 4719 files took 42.205079945

Virtually all of this time is spent in binary?... presumably it just takes a while to open and read 4719 files. I tested this by changing binary? to just return false, and filter_files is now almost instant:

> time bundle exec pre-commit run git
filter_files over 4719 files took 0.05782050

Is it possible to skip the binary? check? It's a useful heuristic, but it just does not feel worth 40+ seconds of my time every time I want to run a full precommit check (admittedly not something I do all that often locally, but we do it in CI every time). I'd be okay just listing all my binary filetypes in .pre_commit.ignore instead.

jish commented 7 years ago

Yea, the binary? check has been a source of pain :( It is really tough to determine if a file is actually binary or not.

I would be open to improving performance here.

jfly commented 7 years ago

Yeah, I experienced some of that pain back in #251.

My thinking was that we could add an option for users of pre-commit to opt out of the binary file check. Anyone who opts out of the binary check would have to populate their .pre_commit.ignore file with all the extensions of files they don't want checked. I certainly wouldn't mind if the pre-commit gem started to include a list of extensions it knows not to bother checking (although this might be annoying if someone wants to write a plugin to do .... something with bmp files?)

What's the difference between "We could implement a blacklist like you suggest." and "We could treat files as binary based on extension."?

jish commented 7 years ago

"What's the difference between "We could implement a blacklist like you suggest." and "We could treat files as binary based on extension."?"

Oh, I was assuming you meant you wanted to list each individual binary file. I apologize if I misunderstood. :(

So I guess they both mean ignore files based on a list of extensions, rather than ignore files based on their full filenames.

jfly commented 7 years ago

Ahh, yeah listing individual files seems like just too much of a hassle. If you have a vision for what you want done here, I'd be happy to try to code it up and send in a PR. The simplest solution would be to just remove the binary? check =)