oxsecurity / megalinter

🦙 MegaLinter analyzes 50 languages, 22 formats, 21 tooling formats, excessive copy-pastes, spelling mistakes and security issues in your repository sources with a GitHub Action, other CI tools or locally.
https://megalinter.io
GNU Affero General Public License v3.0
1.88k stars 224 forks source link

Concurrent parallel jobs in Gitlab CI (and others) #2303

Closed bstrdsmkr closed 1 year ago

bstrdsmkr commented 1 year ago

Is your feature request related to a problem? Please describe. We can't currently make full use of available resources on other machines

Describe the solution you'd like An option to only run a subset of selected linters based on environment variables that indicate the total number of jobs in the array and the current job's index in the array using CI features like Gitlab's parallel: directive

Describe alternatives you've considered Hacked together the following solution myself, but this could be much more efficient properly integrated:

 from itertools import cycle
 from megalinter import Megalinter, linter_factory

 megalinter = Megalinter({"cli": True})
 # deactivates linters that dont match any files
 megalinter.collect_files()

 active_linters = sorted((linter for linter in megalinter.linters if linter.is_active), key=lambda x: x.linter_speed)

 node_index = int(os.environ.get("CI_NODE_INDEX")) - 1 #Gitlab env var isn't zero indexed
 node_total = int(os.environ.get("CI_NODE_TOTAL"))

 # create an empty list for each node in the job array
 node_buckets = [[] for node in range(node_total)]

# round robin through the buckets, dropping linters in each bucket in order of linter_speed
 queue = cycle(node_buckets)
 for linter in active_linters:
        next(queue).append(linter)

 with open('job_linters', 'w') as f:
   f.write(','.join(linter.name for linter in node_buckets[node_index]))

The initialization of the class outputs a bunch of stuff that doesn't seem to have a way to disable, so I write the list of linters to a file and then export the content of that file as ENABLE_LINTERS.

I've also made use of the linter_speed attribute in a naive attempt to normalize the load across all jobs in the array.

Kurt-von-Laven commented 1 year ago

Can you elaborate on how greater efficiency could be achieved via integration of this proposal into MegaLinter? My gut reaction is that ENABLE_LINTERS offers a general mechanism by which parallelization across machines can be performed. I am less certain that accepting the total number of jobs and current job's index would generalize well, but, as always, open to being educated. I believe you already know this, but in case it helps others later on, MegaLinter already parallelizes work across multiple cores within a single machine.

Since you have recorded the linter speeds, you could use a well-established bin packing algorithm to reduce the time spent waiting on the slowest runner. I haven't specifically investigated binpacking, but it is an example of a package that does this that you could use or emulate. That being said, I would expect linter runtimes to be roughly power law distributed, meaning the runtime is dominated by a small number of linters (possibly as few as one or two). As a consequence, a simple-minded approach may be to dedicate a full job to each of the slowest linters and then run all (or half, a third, etc.) of the remaining linters in each of the remaining jobs. For those with an abundance of runners, I suppose a single job per linter is an option, although it certainly wouldn't be as efficient. Monorepos may be able to boost performance via sparse checkouts, scanning different directories in different jobs, possibly with some redundancy of shared config files, for example. On a side note, it wouldn't surprise me if adjusting the log level significantly impacts performance, so that could be another win for some people. We use LOG_LEVEL=warning partly for this reason. Everyone concerned with performance should also make sure they are using the appropriate flavor if possible.

bstrdsmkr commented 1 year ago

I suppose I really meant ergonomic instead of efficient above. My approach is still using ENABLE_LINTERS as a general parallelism mechanism since the script is ran with the same arguments and environment as the actual linter runs, any linters the user has enabled/disabled will be maintained appropriately, just dynamically split.

As far as a native implementation being more ergonomic, I'd like to see a flag like "ENABLE_JOB_ARRAYS" that looked up say MEGALINTER_ARRAY_INDEX and MEGALINTER_ARRAY_TOTAL. This mode would process and decide linters to run as normal, but with one extra step -- only run X/Y out of the selected jobs (sorted by whatever mechanism turns out to be appropriate). Ideally, if enabled and the megalinter specific variables aren't set, it would fallback and look for the common CI platform specific variables (Gitlab, GitHub, circleci, and azure support this type of parallelism that I know of, I'd be shocked it most others didn't).

Why is this useful? We dedicate most resources to the running applications and keep linting/testing resources on a fairly tight leash (128m cores, 128mb ram until proven otherwise) since 4minutes vs 5minutes for a lint run isn't that big of a deal. We often have a lot of available resources spread throughout the cluster, but rarely in one contiguous chunk on one machine. If we have to reserve resources to be able to run all the linters we want in a single job, the job has to queue until a hole that size opens up for the appropriate amount of time. This queue time often ends up much longer than the actual time it takes to run the linters. This is pretty standard fare in the HPC world where the machines are expensive so the goal is to optimize the amount of time the cores are busy over the actual runtime of an individual job. This variable based approach let's the end user move the dial between faster runs (smaller number of larger runs) and faster total pipeline time (larger number of smaller runs that can squeeze into the gaps, up to a point of diminishing returns of course) without having to hard code the particular linters they want where.

In summary, I think megalinter is currently sanely optimized for compute speed efficiency. In our case, we'd like to option to optimize for resource availability in exchange for a loss in raw speed per lint.

nvuillam commented 1 year ago

Basically... you'd like to be able to build a cluster or MegaLinters with some load balancer dispatching linter jobs between the members of the cluster, and unifying results at the end ? :)

bstrdsmkr commented 1 year ago

In a manner of speaking...

The code from my original post really does what I want, but relies on some internal stuff that could easily get broken, so it would be much better if Megalinter supported this natively. From megalinter's point of view, it looks something like this:

For my particular use case, there's no need to unify the results at the end as each copy of the job will produce it's own report and Gitlab will display them together at the end.

I could definitely see a use case for a separate generic function to join the tests together, but I'd leave gathering them up and calling that "join" function to the user. Something like python megalinter.reports.join --output-file=my-combined-report.xml my-report-*.xml

nvuillam commented 1 year ago

Basically... you'd like to be able to build a cluster or MegaLinters with some load balancer dispatching linter jobs between the members of the cluster, and unifying results at the end ? :)

Kurt-von-Laven commented 1 year ago

One of the main challenges I see here is that there may be a lot of inefficiency unless we compute reasonably accurate linter weights, which will vary widely between repositories. This implies calculating linter weights on the fly; storing them somewhere; and hence making MegaLinter stateful, making testing, QA, and debugging more complex. That also implies MegaLinter knowing how to store those weights on every supported CI system. A simpler approach may be to add a hard-coded boolean property to some linters (e.g., SLOW: true). Then, when asked to divvy up jobs, the fast linters combined would be assumed to have the same speed as a single slow linter. My hypothesis is that this would allow a standard bin-packing algorithm to divvy up the jobs quite well without as much complexity.

bstrdsmkr commented 1 year ago

For my two cents, the sorting doesn't matter much to me at all, I just saw that it was there and figured it was as good of an option as any. Entirely unsorted works more or less the same for me as I'm more concerned about being able to automatically divvy them up based on the CI-provided facilities (vs manually tweaking and guessing about which job to group with what).

On that note, is the linter_speed attribute not already populated? I stole the idea from here: https://github.com/oxsecurity/megalinter/blob/main/megalinter/linter_factory.py#L145

nvuillam commented 1 year ago

The sorting is indeed already done using linter_speed: the slower linters are run first , in order to optimize the total time when running in parallel :)

Kurt-von-Laven commented 1 year ago

Oh, nice! Then we can just use a standard bin-packing algorithm using linter_speed for the weights, and I think this is a reasonable feature proposal, although full disclosure that I can't see myself picking it up for quite a long time, so it may end up in the community self-serve bucket, particularly as it is not very valuable to small projects. We are always happy to support if folks decide to take a PR on though. @nvuillam, how do we populate linter_speed out of curiosity?

nvuillam commented 1 year ago

@Kurt-von-Laven just observation about using MegaLinter on my repos :p and when it is not populated, it's medium by default :)

bstrdsmkr commented 1 year ago

How would you feel about a PR with my initial naieve approach? If you assume the weights are roughly accurate, the slowest should get sent to different instances first (assuming I got the ascending/descending directionality right).

Then a future pass might consider proper bin packing and optimization.

This gives me another idea though... (As a separate change of course) maybe we could record actual run times on a given repo and load them in for sorting at run time. The easiest on users would be loading a CSV or list and doing the math at runtime, but maybe the easiest to maintain would be an env var/config option per linter, something like MEGALINTER_$LINTERNAME_RUN_TIME that expects a number of seconds. Over time, users could measure run time and build up a pretty accurate estimate for their conditions.

Maybe in the future, there's a public collection of runtime telemetry users could opt into contributing to automatically?

Further bonus points if Megalinter has an option to output that data in a form it can load later.

Just spitballing ideas :)

My open source contribution time is unfortunately much more limited than I'd like, but if you think my initial approach is good for a first pass, I might be able to knock out a PR

echoix commented 1 year ago

This gives me another idea though... (As a separate change of course) maybe we could record actual run times on a given repo and load them in for sorting at run time. The easiest on users would be loading a CSV or list and doing the math at runtime, but maybe the easiest to maintain would be an env var/config option per linter, something like MEGALINTER_$LINTERNAME_RUN_TIME that expects a number of seconds. Over time, users could measure run time and build up a pretty accurate estimate for their conditions.

My two cents here would be maybe something more generic, like weights so we (or advanced users) could still have some flexibility later on. Here it's specialized for only runtime. With an implementation I expect these users to find something that works better for them.

nvuillam commented 1 year ago

How would you feel about a PR with my initial naieve approach?

I'm not sure I understood everything, but if it can bring community recognized improvements about MegaLinter, you can give it a try :)

Further bonus points if Megalinter has an option to output that data in a form it can load later.

Maybe JSON reporter ? https://megalinter.io/beta/reporters/JsonReporter/

Maybe in the future, there's a public collection of runtime telemetry users could opt into contributing to automatically?

The choice has been made to not track users, and I don't plan to change that, sorry ^^

bstrdsmkr commented 1 year ago

My two cents here would be maybe something more generic, like weights so we (or advanced users) could still have some flexibility later on. Here it's specialized for only runtime. With an implementation I expect these users to find something that works better for them.

ooohh... I hadn't considered "not-time-based" weights but that makes a ton of sense for extensibility. I can't articulate it, but I feel like there's a case in there somewhere for sorting lexically? Like setting the "weight" to a machine name or something? I suppose if we took care to just not assume what the "weight" indicated, it wouldn't matter so long as it was something sorted() understood...

What's not clear to me at the moment is how end users can set the "weight" of a linter. Is there some existing facility for users to set the linter speed? If not, my first thought is looking for an env var like MEGALINTER_${LINTER_NAME}_WEIGHT and falling back to .linter_speed if not found

Maybe JSON reporter ? https://megalinter.io/beta/reporters/JsonReporter/

oh that's great! Somehow I'd glossed over that before. My first thought was MEGALINTER_WEIGHTS_FROM_JSON=/path/to/some.json that expects the listed file to have a json path like linters['name'==$LINTER_NAME].elapsed_time_s so that it would be compatible with the JsonReporter output as is, but then that locks in the format and feels a little weird if a user wants weight to be something that isn't "elapsed time in seconds." That also doesn't lend itself well to averaging over time in the numeric case (no way to tell how many runs went into the calculation of the average in order to add in the new data point).

It feels like a dedicated "run speed" report would be the best answer. To keep with what already exists, json is a natural format but isn't the most ergonomic for a user to stitch together for a set of runs (each with a unique set of linters ran). YAML being a superset of json seems to lend itself here. How do you feel about a yaml report? When used as input for sorting, it could look something like:

linter1: 45
linter2: 10
linter3: 0

Unspecified linters would default to python none so that we don't make any assumptions about the type of the weight and just let sorted() decide what to do with it. For the simple case of stitching non-overlapping (ie, no linters in common) reports together, the reports can just be concatenated together.

The downside is for users interested in averaging the runtime over a number of runs, they'd have to track the number of runs separately. By sacrificing the determinism of the report's schema, we could make it easy for users to track that by making the values maps (dictionaries) instead of scalars and passing all other fields through untouched to the output. So with input:

linter1:
  weight: 45
  foo: bar
  runs: 5
linter2:
  weight: 12
  baz: quux

We might get output like:

linter1:
  elapsed_time_s: 44
  weight: 45
  foo: bar
  runs: 5
linter2:
  elapsed_time_s: 15
  weight: 12
  baz: quux

Then the user can cat all the reports together easily and pipe them through something like yq to calculate a new average from the weight and runs fields they passed in and the elapsed_time_s field generated by the report.

Maybe in the future, there's a public collection of runtime telemetry users could opt into contributing to automatically?

The choice has been made to not track users, and I don't plan to change that, sorry ^^

I 100% understand (and appreciate!) that decision. I did want to clarify though that my thought was something akin to a public Prometheus instance that users could send metrics to (by opt-in of course). So what would be sent would be something like:

linter1: 300
linter2: 75

So nothing identifying or etc, but I'm also not sure how useful that would really be if runtimes are going to vary a lot over the individual code base... (not to mention the potential for bad actors to send garbage data)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

bstrdsmkr commented 1 year ago

Haven't forgotten this, draft PR in the works, just haven't had time to finish it up yet 😁

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.