hedronvision / bazel-compile-commands-extractor

Goal: Enable awesome tooling for Bazel users of the C language family.
Other
662 stars 108 forks source link

perf: use orjson for writing compile_commands #118

Closed aminya closed 8 months ago

aminya commented 1 year ago

orjson is much faster than the standard library's json module (1.9 seconds vs 6.6 seconds for a ~140 MB file).

Benchmarks:

orjson: 50415 function calls (50331 primitive calls) in 1.919 seconds

json: 21663390 function calls (18606459 primitive calls) in 6.588 seconds

json no-indent: 21660892 function calls (18603961 primitive calls) in 6.437 seconds

aminya commented 1 year ago

I could not replace the line where json.loads uses object_hook as it is not supported by orjson. https://github.com/ijl/orjson#will-it-deserialize-to-dataclasses-uuids-decimals-etc-or-support-object_hook

cpsauer commented 1 year ago

Hey, @aminya! Thanks for looking to contribute, leaving things even better than you found them. I'm definitely open to big, easy performance wins :)

The original json stuff had just been for simplicity, since it's built in, and I was thinking that the runtime was going to be dominated by header finding. Since then we've added the caching you're speeding up, so that's probably no longer true.

Just to double check: Those benchmark numbers are for the overall runtime of this tool, right? (Hitting cache on a subsequent run.) If so, wow!

What do you think about having bazel auto-install (with pip_parse, I think they call it now)? That way most people will actually get the benefits, and it'll be a little simpler to maintain things, since there'll be just one code path. If you think that's good, too, I'd love it if you'd switch us over to that.

Couple other things:

Thanks! Chris

aminya commented 1 year ago

Hey, @aminya! Thanks for looking to contribute, leaving things even better than you found them. I'm definitely open to big, easy performance wins :)

Thank you! I would be happy to contribute.

The original json stuff had just been for simplicity, since it's built in, and I was thinking that the runtime was going to be dominated by header finding. Since then we've added the caching you're speeding up, so that's probably no longer true.

Just to double check: Those benchmark numbers are for the overall runtime of this tool, right? (Hitting cache on a subsequent run.) If so, wow!

Yes, that was a refreshing benchmark after hitting the cache. Since I run this command as part of my build command every time, my whole build process is faster!

What do you think about having bazel auto-install (with pip_parse, I think they call it now)? That way most people will actually get the benefits, and it'll be a little simpler to maintain things, since there'll be just one code path. If you think that's good, too, I'd love it if you'd switch us over to that.

That would be great. I am not familiar with Bazel tooling for Python, but I am pretty sure it can be done if we want to. That could allow us to potentially make more optimizations as well.

Couple other things: Re object_hook: We could strip that out and go to string-key access--and probably should if we're switching over to orjson wholesale. What do you think?

Yeah, I did not know how I can replace this line of the code. If you know, it would be helpful.

Let's have orjson pretty-print if it doesn't already! Human readability has been a huge win for issue reporting and debugging for people.

I think pretty printing should be added behind an option as it has some performance penalties. I don't really look at compile_commands.json, and if I want to, I just use VsCode's autoformatting feature inside the IDE to un-minify the code. The formatting works for my huge files with no issues (~130 MB). But I can see when some people might need it.

cpsauer commented 1 year ago

Sweet! Love the wins. Thanks again for your willingness to contribute when you spotted an improvement opportunity and for benchmarking as you worked on perf.

I'll knock out weaving in the import of rules_python and merge if you'll knock out the other items (above, and expanded below).

(Configuring transitive deps in the WORKSPACE & setting up renovate auto update is a bit tricky if you haven't done it before, and I just did it for https://github.com/hedronvision/bazel-make-cc-https-easy, so I think I have comparative advantage there.)

In the meantime, just import rules_python from their WORKSPACE snippet to unblock yourself, and then I'd love it if on your end you'd:

^ These are just suggestions. If you notice better ways of doing things or more things that need doing, please just do 'em. And let me know if you need setup help or unblocking of any other kind.

Can't wait to get you on the contributors list :) Chris

cpsauer commented 1 year ago

Heads that I just landed some changes under you to fix a different issue. Shouldn't cause issues though, I don't think. Just heads up that there's no longer any need to undo any of the main() stuff you added.

cpsauer commented 1 year ago

Hey, Amin! Just checking in on how things are going. If stuck, that's ok--just lmk.

aminya commented 1 year ago

@cpsauer I tried using the Bazel stuff for pip and python. However, I was not successful. TBH, I haven't used Bazel for any Python projects, and I don't find it straightforward to implement. https://github.com/bazelbuild/rules_python/

garymm commented 1 year ago

I spent about an hour trying to use rules_python to install orjson but I don't see how it's possible do it entirely in the hedron_compile_commands_setup macro. It seems to requires load statements that have to be executed at the top level, but some of those load statements are loading things from external repos (i.e. rules_python). Is there some way around this?

i.e. where can we put the code that bottoms out in a call to whl_library? That requires having rules_python available first

load("@rules_python//python/pip_install:pip_repository.bzl", "whl_library")

Maybe this can be worked around with bzlmod?

Anyways here is my attempt, feel free to give me suggestions or use it: https://github.com/aminya/bazel-compile-commands-extractor/pull/1/files

cpsauer commented 1 year ago

For now just assume you've got rules_python--just put it above hedron_compile_commands_setup in your workspace. (As above, I'm happy to take care of that setup wrappings problem; I've done it before. I'll unfortunately require a second tier of workspace macros (as in the linked example above). But again, happy to take care of the setup wrappings since I've done them before if y'all do the inner implementation.

Thanks again! Chris

garymm commented 1 year ago

OK well it works if I can put arbitrary stuff into the WORKSPACE, see my PR: https://github.com/aminya/bazel-compile-commands-extractor/pull/1/files

aminya commented 1 year ago

@garymm I merged your PR, however, I get this error when testing:

ERROR: Failed to load Starlark extension '@rules_python//python/pip_install:pip_repository.bzl'.
Cycle in the workspace file detected. This indicates that a repository is used prior to being defined.
The following chain of repository dependencies lead to the missing definition.
 - @rules_python
This could either mean you have to add the '@rules_python' repository with a statement like `http_archive` in your WORKSPACE file (note that transitive dependencies are not added automatically), or move an existing definition earlier in your WORKSPACE file.
ERROR: Error computing the main repository mapping: cycles detected during computation of main repo mapping

To be honest, I find the Bazel setup overly complicated. Maybe we could just add a shell script that installs orjson globally for the user and be done with that.

aminya commented 1 year ago

I realized I have to add the following to my project's Workspace before I can use hedron_compile_commands_setup. Can this requirement be handled automatically?

BAZEL_SKYLIB_VERSION = "1.4.2"

http_archive(
    name = "bazel_skylib",
    sha256 = "66ffd9315665bfaafc96b52278f57c7e2dd09f5ede279ea6d39b2be471e7e3aa",
    urls = [
        "https://mirror.bazel.build/github.com/bazelbuild/bazel-skylib/releases/download/{0}/bazel-skylib-{0}.tar.gz".format(BAZEL_SKYLIB_VERSION),
        "https://github.com/bazelbuild/bazel-skylib/releases/download/{0}/bazel-skylib-{0}.tar.gz".format(BAZEL_SKYLIB_VERSION),
    ],
)

http_archive(
    name = "rules_python",
    sha256 = "84aec9e21cc56fbc7f1335035a71c850d1b9b5cc6ff497306f84cced9a769841",
    strip_prefix = "rules_python-0.23.1",
    url = "https://github.com/bazelbuild/rules_python/releases/download/0.23.1/rules_python-0.23.1.tar.gz",
)
aminya commented 1 year ago

@cpsauer @garymm Should I revert to the version where it optionally needed orjson or can we fix the installation?

aminya commented 1 year ago

Since it was not possible to easily install orjson, I reverted the code to the optional orjson usage with a note in the README. @cpsauer Could you please marge this? I believe automating the installation is out of the scope of this pull request and should be done separately later.

aminya commented 11 months ago

@cpsauer A friendly reminder in case you missed my comment https://github.com/hedronvision/bazel-compile-commands-extractor/pull/118#issuecomment-1707119002

aminya commented 8 months ago

I rebased this branch to be up to date with the recent changes.

There's a backup of the old branch here: https://github.com/aminya/bazel-compile-commands-extractor/tree/perf-backup

cpsauer commented 8 months ago

Hey guys! I've merged this in. (I know it says closed; had to to a bunch of side work, but I drew upon your work.) See the code in https://github.com/hedronvision/bazel-compile-commands-extractor/commit/0e5b1aa26d87a431d2a52676d0b9ce469448ee54

We're now automatically using hermetic python and installing orjson! I will say; I'm not actually seeing meaningful speedups on my machine, even with json files that are about the same size. Please give this a whirl and confirm that it does indeed speed things up for you.

Anyone else seeing an error like: Unable to find package for @hedron_compile_commands_pip//:requirements.bzl: The repository '@hedron_compile_commands_pip' could not be resolved: Repository '@hedron_compile_commands_pip' is not defined. Grab the new WORKSPACE snippet from the README--or switch to the bzlmod one. Sorry for the breaking change; I'd tried to design in adaptability, but Bazel's WORKSPACE system has it's limits.

aminya commented 8 months ago

Thanks @cpsauer!

I'll benchmark it again and will let you know. I added beautification back that could slow down the JSON dump. Didn't benchmark after that.

cpsauer commented 7 months ago

As a heads, rules_python is causing enough issues that I think it's likely we'll likely need to revert.

Was orjson indeed important to you, @aminya, showing up in benchmarking? By default would probably revert that too, since it didn't seem to speed things up in my tests.

cpsauer commented 7 months ago

Boo, okay, we had to revert hermetic rules_python and orjson in https://github.com/hedronvision/bazel-compile-commands-extractor/commit/0b821b7e4286aec887757461366f6eaaa0972cb9, because there were lots of issues :( Tracking restoration in https://github.com/hedronvision/bazel-compile-commands-extractor/issues/168.