mandiant / macos-UnifiedLogs

Apache License 2.0
211 stars 17 forks source link

Lowering the memory footprint of parse results #14

Open mrguiman opened 6 months ago

mrguiman commented 6 months ago

Hello and thank you for the amazing work on the library.

We're working on an iOS forensics app and currently prototyping some code that will allow our researchers to pull specific, pinpointed events from UnifiedLogs. We based our code on one of your exemples, so it currently revolves around calling the build_log method to assemble LogData and "send" them to be consumed by other parts of our code. In that context, and because we want our program to be as portable as possible, we're trying to have the lowest possible memory footprint and as such would like to be able to work with iterators rather than allocating an entire Vector of LogData which can be quite heavy.

We were thinking of breaking down the build_logs method so that it can build vectors from the results of a new iter_logs function that itself would return an iterator.

Is there any footgun you can think of that we should be worrying about, and would you be open to us contributing such changes to the main codebase ?

puffyCid commented 6 months ago

hey @mrguiman thanks for opening this issue. I think creating a iter_logs function is a great idea. I would definetly be interested in any memory improvemnts you or your team could do.

jrouaix commented 6 months ago

@puffyCid FYI, we try to go a little further on granular iteration, but memory consumption didn't improve : https://github.com/shindan-io/macos-UnifiedLogs/commit/62dfb2203c44aae570a9c2c39f10ac5b5451ccb3

I my opinion a really great improvement would be to iterate straight away on a file itself, at nom parser level.

We can try, but it could be a BIG rework of the existing code (even trying to be as conservative as possible).

Would you be open to such a thing ? if yes, we could motivate ourself into this goal.

puffyCid commented 6 months ago

@jrouaix that does sound like quite a large rewrite.

I think the primary source of the memory usage is after reading the tracev3 file (~10.5 MBs) the library iterates through all the compressed chunks in the file and then returns the data needed for build_log function.

I think the decompressed data is probably going to be the reason for memory usage?

What if instead of decompressing all chunks in the parse_unified_log function it just decompresses and parses one chunk and returns the data and pointer/offset to next chunk?

so instead of parse_unified_log(data: &[u8]) -> nom::IResult<&[u8], UnifiedLogData>

it would be something like parse_unified_log(data: File, offset: usize) -> nom::IResult<&[u8], UnifiedLogData> The main thing to add would be something to track the next offset.

Thoughts on perhaps that approach?

mrguiman commented 3 months ago

Sorry @puffyCid it has been a while, we've been busy building our apps ! There's a first pull requests that was created here, do you want to hold off on it until we go further (as discussed above)?

puffyCid commented 3 months ago

no worries @mrguiman!

There is a few merge conflicts between the PR and main branch. If you could update/sync the PR to the latest version. I would be fine with committing.

For further reducing memory usage i think there are two approaches:

  1. As @jrouaix mentioned start the iterator at the nom parser level
  2. Parse only one chunk of the log file at a time (instead of all chunks currently). And keep track of the offset to the next chunk

Both would be pretty large overhauls (1. being the largest). I think option 1. makes the most sense and keeps the public API/functions easier to use.

If you guys want to take a stab at it that would be great. Otherwise, if your too busy with other projects, no worries.

Time permitting I can try to implement it, with the goal having it merged sometime in September (hopefully)

jrouaix commented 3 months ago

I agree the option 1 makes the more sense, could really flatten the memory footprint (and we can build more iterators over this low level one).

It's a big refactoring, and difficult. We won't be able to dig anything until late august, If you want to take the lead, feel free to. If not we'll let you know when/if we deep dive into this.