Rust/His: Allow for concurrent scans

beaubelgrave commented 3 months ago

Today the existing Scan struct tracks the options to be used as well as the current state of the stream. This is useful when doing many streams of data on a per-core basis. It is not as useful if you are using many threads.

When many threads are being used, it's more useful to have a single immutable reference that can be passed small mutable states that can individually be reset or thrown away.

Add ScanState struct that solely keeps track of the state of a given stream of data.

Add concurrent_parse_bytes() and concurrent_parse_reader() functions that take a ScanState mutable reference while operating on the immutable Scan struct. This allows for many threads to share the same Scan struct.

Move the internal scanning state within the Scan struct to a RefCell ScanState. This allows us to keep our API the same and not break our FFI layers while allowing for immutable changes to the byte scan.

Move the various byte scan logic methods to immutable self references that take in a mutable ScanState reference.

michaelcfanning commented 3 months ago

We probably should start investing in a change log, such as we have in ReleaseHistory.md for the C# code. I will look at the C# implications for your change tomorrow, but we can merge this if you like.

jdraaijer-microsoft commented 3 months ago

I have a stylistic/bikeshedding question, as I'm wondering (but feel free to disregard):

I think having both the concurrent and non-concurrent functions on Scan creates some API disparity, as one the non-concurrent ones update some internal state on Scan, while the concurrent ones do not. It's obvious from the function signatures, but not super necessary. Is there a specific reason why you've decided against creating an additional "scan configuration" struct (besides the fact that it is less code to change) that would be responsible for the concurrent functions?

Additionally, storing the existing PossibleMatches outside of the ScanState struct could let you avoid the RefCell alltogether, and give a caller more freedom w.r.t. what they want to do with the existing matches (without having to reset the entire state).

beaubelgrave commented 3 months ago

I have a stylistic/bikeshedding question, as I'm wondering (but feel free to disregard):

Sure.

I think having both the concurrent and non-concurrent functions on Scan creates some API disparity, as one the non-concurrent ones update some internal state on Scan, while the concurrent ones do not. It's obvious from the function signatures, but not super necessary. Is there a specific reason why you've decided against creating an additional "scan configuration" struct (besides the fact that it is less code to change) that would be responsible for the concurrent functions?

The main reason is we already have a release, and that includes an FFI that assumes certain things. It's possible to move everything to the concurrent_* versions. The safest route, IMHO, was to make two different versions of the API, keep the FFI using the existing pattern, and let newer code choose if they want that level of management or not.

This would be like how the Read trait in Rust kind of evolved. You have the basic read(), then you have read_exact(), read_to_end() etc, that leverage that single read() base call. And like the Seek trait, where the objects typically store their own offset, but sometimes you can read directly without the Seek part (IE: Memory based reader).

Additionally, storing the existing PossibleMatches outside of the ScanState struct could let you avoid the RefCell alltogether, and give a caller more freedom w.r.t. what they want to do with the existing matches (without having to reset the entire state).

RefCell overhead is minimal and only when using the stream processor mode, when concurrent is used there is no overhead of the RefCell (other than its mere existence in the Scan struct). Many scenarios I know about require to know all the possible matches for redactions or auditing. I've not come across anything where you'd really want to split these out. If we do come across that need, a simple take_possible_matches() can be added to the ScanState.

Hope this helps clarify at least where I'm coming from regarding this.

microsoft / security-utilities

Rust/His: Allow for concurrent scans #68