phenomnomnominal / betterer

betterer makes it easier to make incremental improvements to your codebase
MIT License
581 stars 38 forks source link

Running on a large monorepo with 20 regexp tests results in "EMFILE: Too many files open" errors #1050

Open eohehir-coursera opened 2 years ago

eohehir-coursera commented 2 years ago

Hi there! We're trying to run Betterer on my company's repository and we're running into some issues. Some of these issues are because the repository we're running these on is a monorepo, so there's a lot of code in there to begin with.

We're only able to run 3-5 successful tests at a time, and all other tests fail with the error EMFILE: Too many files open, and an error at the top that reads MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 message listeners added to [Worker]. Use emitter.setMaxListeners() to increase limit.

Currently how it's running:

There are 20 total tests, all of which are regexp file tests and look almost exactly like this:

import { regexp } from '@betterer/regexp';

const starterFilePath = 'static/**/*';
const excludedFiles = [/__generated__/, /__stories__/, /__snapshots__/, /\/nls\//, /nls_json/, /vendor/];

...
export default {
  // There are 20 tests that look pretty much like this
  'remove ts-ignore': () =>
    regexp(/(ts-ignore)/i)
      .include(`${starterFilePath}.[jt]s{,x}`)
      .exclude(excludedFiles),
}

We're running betterer across the codebase by running yarn betterer --cache. That starterFilePath points to a folder that has, 57,208 items inside, and several subfolders underneath. I know, there's a lot there!

I'm trying to see if I can get those 20 tests to properly run without running into those EMFILE issues. Here's what I've tried so far, and unfortunately nothing has worked just yet:

  1. Adjusting ulimit to increase the total number of open files for my computer (set to unlimited for every possible option now and still fails)
  2. Running betterer --cache --workers false to test sequentially (still get EMFILE errors)
  3. Running betterer --cache 'static/**/*' from the command line (runs much slower and still get EMFILE errors)
  4. Running betterer --cache 'static/**/*' --exclude /__generated__/ /__stories__/ /__snapshots__/ /\/nls\// /nls_json/ /vendor/ (to match the starter file path and exclusions in the tests, still runs slow and still fails)

It works just fine when we only run one test at a time using .only(), and if the cache detects that files are unchanged, it's able to run through the previously-ran tests okay. Is there anything else we can do to make sure all the right tests are running?

phenomnomnominal commented 2 years ago

Hey! Yeah that doesn't particularly surprise me. Each test is going to create a worker thread and then have to open all those files and check them. My suggestion would be to create a custom test (you can probably copy the existing regexp test and run all 20 regexps on the files in that one test.

This will be much faster, use much less memory and only open each file once. You might still hit memory issues, so you would need to increase the node heap size

eohehir-coursera commented 2 years ago

Good suggestions, thank you! I'll take another look and see about running one larger test.

If I do that, will I still be able to log the "errors" as separate types of issues, or will those all be consolidated under one big number? I'd like to see about tracking these issues from the different regular expressions I have separately still, if possible.

kubaprzetakiewicz commented 2 months ago

@eohehir-coursera wondering, how did you go around this issue in the end?

i've just introduced concurrency limiting which works fine i guess? (ty Copilot btw)

however, i'd rather do what @phenomnomnominal suggested, but then would hit the same issue you were wondering about - where issues wouldn't be split into their own categories

eohehir-coursera commented 1 month ago

@eohehir-coursera wondering, how did you go around this issue in the end?

i've just introduced concurrency limiting which works fine i guess? (ty Copilot btw)

however, i'd rather do what @phenomnomnominal suggested, but then would hit the same issue you were wondering about - where issues wouldn't be split into their own categories

I was never able to get further than this point since I had to move onto other work, sorry to say!