Allow multiple workers per file to run linters

srawlins commented 9 years ago

I'm looking into ways to make scss-lint faster. If I were being rigorous this would include some more data [1].

One idea I have that is playing well with one sample file is to multithread SCSSLint::Runner#find_lints. I've measured a speedup when splitting the list of @linters into 2, 3, or 4 lists, and giving each section to a different thread. 4 was the best speedup, cutting time by about 25%. 6 and 8 threads were no better (probably because, with ~50 lints, sublists of 12 linters and 6 linters are not too different).

Would the project be open to a PR with a --workers option (I'm open to other names)? A second idea is to partition the SCSS files into smaller groups, and multithread on those, since even MRI can profit from each Thread waiting on the file reads. I can open up a second issue for that idea.

[1] Properly, I'd like to look at some large, real-world SCSS files (maybe Bootstrap 4 or some big public Rails apps...) and:

enable one linter at a time, reviewing performance
change number of files linted at a time to find correlation
change lines per file to find correlation

But I don't have plans to do this yet.

sds commented 9 years ago

I've thought about this as well, and would be open to accepting a pull request. scss-lint should be relatively straightforward to parallelize.

I would like the flag name to be called --concurrency/-C and accept a numeric value or auto to indicate scss-lint should just pick a number that "makes sense" based on the number of cores available. I haven't done any research on this, but if you find that on average half the number of cores is optimal, let's go with that.

My other comment is that you can go further than splitting the linters into groups. If you can create a queue of all linter/file pairs that should be run, you can spawn concurreny number of workers and have them all pull from the same queue, so each "job" is passed an instance of a linter and a file to lint. I'm willing to bet the cost of instantiating a linter per job would still be faster than trying to reuse the same linter instance across multiple jobs. In theory this should allow you to maximize parallelism.

Thanks for your thoughts. Look forward to seeing what you come up with!

sds commented 8 years ago

I had a quick poke at trying to parallelize using the parallel gem, which forks into separate processes under the hood, getting past MRI's GIL which limits parallelism via threads (it also saves us from having to make any of the code thread safe since workers are isolated in separate processes).

The patch is pretty straightforward:

diff --git a/lib/scss_lint/runner.rb b/lib/scss_lint/runner.rb
index 16f4b5a..5cde35f 100644
--- a/lib/scss_lint/runner.rb
+++ b/lib/scss_lint/runner.rb
@@ -1,3 +1,5 @@
+require 'parallel'
+
 module SCSSLint
   # Finds and aggregates all lints found by running the registered linters
   # against a set of SCSS files.
@@ -15,9 +17,9 @@ module SCSSLint
     # @param files [Array<Hash>] list of file object/path hashes
     def run(files)
       @files = files
-      @files.each do |file|
+      @lints = ::Parallel.map(@files) do |file|
         find_lints(file)
-      end
+      end.flatten
     end

   private
@@ -26,29 +28,35 @@ module SCSSLint
     # @option file [String] File object
     # @option path [String] path to File (determines which Linter config to apply)
     def find_lints(file)
-      engine = Engine.new(file)
-
-      @linters.each do |linter|
-        begin
-          run_linter(linter, engine, file[:path])
-        rescue => error
-          raise SCSSLint::Exceptions::LinterError,
-                "#{linter.class} raised unexpected error linting file #{file[:path]}: " \
-                "'#{error.message}'",
-                error.backtrace
+      accumulated_lints = []
+
+      begin
+        engine = Engine.new(file)
+
+        @linters.each do |linter|
+          begin
+            run_linter(linter, engine, file[:path], accumulated_lints)
+          rescue => error
+            raise SCSSLint::Exceptions::LinterError,
+                  "#{linter.class} raised unexpected error linting file #{file[:path]}: " \
+                  "'#{error.message}'",
+                  error.backtrace
+          end
         end
+      rescue Sass::SyntaxError => ex
+        accumulated_lints << Lint.new(nil, ex.sass_filename, Location.new(ex.sass_line),
+                          "Syntax Error: #{ex}", :error)
+      rescue FileEncodingError => ex
+        accumulated_lints << Lint.new(nil, file[:path], Location.new, ex.to_s, :error)
       end
-    rescue Sass::SyntaxError => ex
-      @lints << Lint.new(nil, ex.sass_filename, Location.new(ex.sass_line),
-                         "Syntax Error: #{ex}", :error)
-    rescue FileEncodingError => ex
-      @lints << Lint.new(nil, file[:path], Location.new, ex.to_s, :error)
+
+      accumulated_lints
     end

     # For stubbing in tests.
-    def run_linter(linter, engine, file_path)
+    def run_linter(linter, engine, file_path, accumulated_lints)
       return if @config.excluded_file_for_linter?(file_path, linter)
-      @lints += linter.run(engine, @config.linter_options(linter))
+      accumulated_lints.concat(linter.run(engine, @config.linter_options(linter)))
     end
   end
 end
diff --git a/scss_lint.gemspec b/scss_lint.gemspec
index aa83e49..634136a 100644
--- a/scss_lint.gemspec
+++ b/scss_lint.gemspec
@@ -25,6 +25,7 @@ Gem::Specification.new do |s|

   s.required_ruby_version = '>= 1.9.3'

+  s.add_dependency 'parallel', '~> 1.6'
   s.add_dependency 'rainbow', '~> 2.0'
   s.add_dependency 'sass', '~> 3.4.15'

This worked surprisingly well for one of our large repositories: execution time dropped from 5 seconds to 2 seconds.

However, running this against another one of our large legacy repositories results in a strange marshal data too short error. I think it has something to do with the number of lints returned by the find_lints method, since if I change that to return an empty array no error occurs. This means we might need to serialize the lints to a file and then deserialize them on the master process in order to work around the issue.

sds / scss-lint

Allow multiple workers per file to run linters #576