Improve scan runtime measured in `identifiable_scanning_and_validation_perf_benchmark` by ~20%

jdraaijer-microsoft commented 3 months ago

I saw some opportunity to microbenchmark and somewhat improve this code.

Reported perf goes from ~150ns per iteration to ~120ns per iteration on my machine, in release mode.

I've tried to keep the API the same. It may be worthwhile to expose some additional parts of the code to make reuse of the ScanData and/or ScanState easier in downstream projects, but I've decided not to do any of that in this PR.

~~I seem to also have applied rustfmt to the file. Hope that that's OK.~~

jdraaijer-microsoft commented 3 months ago

@microsoft-github-policy-service agree company="Microsoft"

jdraaijer-microsoft commented 3 months ago

Oh, seems like there is a test failing. Let me fix that

michaelcfanning commented 3 months ago

So I believe Beau, after reviewing your PR, has extracted a fix for the benchmark issue which is now applied. Looks like we failed to reset the scanner object between iterations, with the result that we accumulated state in testing that resulted in a slowdown. It's good to fix this!

If you'd like to submit your formatting changes separately, please go ahead. @beaubelgrave , @suvamM , is reformatting something you'd like to see happen? Or do you prefer the style we currently conform to?

beaubelgrave commented 3 months ago

There's some assumptions here in these changes that need to be addressed.

The Scan struct can be re-used centrally via a call to reset(), however, yes it requires a mut instance and will therefore limit what you can do in parallel with the same object. This was missed in the original benchmark and is now fixed.
The benchmark you are running is likely a non-real world example of real data, as it always contains secrets. In the real world the vast amount of data we scan won't have secrets. This lead you to conclude removing a check I added explicitly for performance in https://github.com/microsoft/security-utilities/pull/65/commits/e2f679fc4f7478a0433f63d228fc60ec9ea46fe3. That check is very useful for performance in the real world dataset according to my benchmark runs via the C# interop.

I agree that likely we want a way to share the one-time settings and char lookups. However, I don't think the amount of refactoring is needed that's in here, and I would hope we can get better commit messages that describe exactly what you are changing and why.

And please, do not submit code mixed with rustfmt as it's very hard to review. With a refactor like this, you really don't want to miss a critical check that leads to missing signatures of secrets (like the internal reset() call, which would subtly hide things and I would have expected caused almost all the tests to fail though, those got changed as well).

beaubelgrave commented 3 months ago

I have an alternative pull request here which separates out state as well as makes the Scan struct have concurrent parsing ability with an immutable reference. This allows many threads to share the same single reference with just small individual state objects.

Concurrency Pull

jdraaijer-microsoft commented 3 months ago

I have an alternative pull request here which separates out state as well as makes the Scan struct have concurrent parsing ability with an immutable reference. This allows many threads to share the same single reference with just small individual state objects.

Concurrency Pull

This PR is definitely a smaller change achieving the same (main) goal of reuse, so merging this PR for that reason is not so useful anymore. If you think it makes sense, I can close this PR and retry the perf things I (thought) I found on top of main when it is merged :)

The perf improvement I observe (in the benchmark only, I'm not sure how to run the interop) still holds, even with the benchmark fix.

The benchmark you are running is likely a non-real world example of real data, as it always contains secrets. In the real world the vast amount of data we scan won't have secrets. This lead you to conclude removing a check I added explicitly for performance in e2f679f. That check is very useful for performance in the real world dataset according to my benchmark runs via the C# interop.

I would be very interested to see how to run this test! More insights would be cool, and Rust <-> C# interop sounds like something I would really like to explore. I agree, the benchmark is rather artificial. I will try to add a case "without secret" to see how much of a difference it makes. Edit: welp, added the test case and it is not very useful :) I guess emulating real life is easier said than done.

I agree that likely we want a way to share the one-time settings and char lookups. However, I don't think the amount of refactoring is needed that's in here, and I would hope we can get better commit messages that describe exactly what you are changing and why.

I will rebase my changes, stripping out the "adding concurrency" stuff and updating the commit messages once your PR is merged. I agree that better commit messages is a must.

And please, do not submit code mixed with rustfmt as it's very hard to review. With a refactor like this, you really don't want to miss a critical check that leads to missing signatures of secrets (like the internal reset() call, which would subtly hide things and I would have expected caused almost all the tests to fail though, those got changed as well).

I undid the rustfmt I had applied sometime on Wednesday evening, but may not be obvious due to the large amount of changes that I did add. Sorry about that.

beaubelgrave commented 3 months ago

I have an alternative pull request here which separates out state as well as makes the Scan struct have concurrent parsing ability with an immutable reference. This allows many threads to share the same single reference with just small individual state objects. Concurrency Pull

This PR is definitely a smaller change achieving the same (main) goal of reuse, so merging this PR for that reason is not so useful anymore. If you think it makes sense, I can close this PR and retry the perf things I (thought) I found on top of main when it is merged :)

If you can get good performance gains, that is always appreciated. This is especially true if it doesn't come with API/large refactors.

The perf improvement I observe (in the benchmark only, I'm not sure how to run the interop) still holds, even with the benchmark fix.

The benchmark you are running is likely a non-real world example of real data, as it always contains secrets. In the real world the vast amount of data we scan won't have secrets. This lead you to conclude removing a check I added explicitly for performance in e2f679f. That check is very useful for performance in the real world dataset according to my benchmark runs via the C# interop.

I would be very interested to see how to run this test! More insights would be cool, and Rust <-> C# interop sounds like something I would really like to explore. I agree, the benchmark is rather artificial. I will try to add a case "without secret" to see how much of a difference it makes. Edit: welp, added the test case and it is not very useful :) I guess emulating real life is easier said than done.

We should probably have a clear benchmark using [bench] instead of [test] that runs a good set of tests we can use for future discussions like this. It's hard to know if things improve or not unless you really look at the assembly output of LLVM and see it's performance in the common case (no secrets). We also have to balance if something comes in with massive false positive secrets, it doesn't cause the scanner to do silly stuff.

I agree that likely we want a way to share the one-time settings and char lookups. However, I don't think the amount of refactoring is needed that's in here, and I would hope we can get better commit messages that describe exactly what you are changing and why.

I will rebase my changes, stripping out the "adding concurrency" stuff and updating the commit messages once your PR is merged. I agree that better commit messages is a must.

And please, do not submit code mixed with rustfmt as it's very hard to review. With a refactor like this, you really don't want to miss a critical check that leads to missing signatures of secrets (like the internal reset() call, which would subtly hide things and I would have expected caused almost all the tests to fail though, those got changed as well).

I undid the rustfmt I had applied sometime on Wednesday evening, but may not be obvious due to the large amount of changes that I did add. Sorry about that.

Yeah, it's just hard to validate nothing actually changed functionally when rustfmt is invoked. I think we can validate this by having rust generate the assembly output before rustfmt and then the same afterwards and compare the assembly. That way we would know if something got tripped up during that process (that would be a cool hackathon project, especially in light of security).

suvamM commented 3 months ago

Closing this PR since @beaubelgrave made the corresponding changes.

microsoft / security-utilities

Improve scan runtime measured in `identifiable_scanning_and_validation_perf_benchmark` by ~20% #65