Open mrguiman opened 6 months ago
hey @mrguiman thanks for opening this issue. I think creating a iter_logs
function is a great idea.
I would definetly be interested in any memory improvemnts you or your team could do.
@puffyCid FYI, we try to go a little further on granular iteration, but memory consumption didn't improve : https://github.com/shindan-io/macos-UnifiedLogs/commit/62dfb2203c44aae570a9c2c39f10ac5b5451ccb3
I my opinion a really great improvement would be to iterate straight away on a file itself, at nom
parser level.
We can try, but it could be a BIG rework of the existing code (even trying to be as conservative as possible).
Would you be open to such a thing ? if yes, we could motivate ourself into this goal.
@jrouaix that does sound like quite a large rewrite.
I think the primary source of the memory usage is after reading the tracev3 file (~10.5 MBs) the library iterates through all the compressed chunks in the file and then returns the data needed for build_log
function.
I think the decompressed data is probably going to be the reason for memory usage?
What if instead of decompressing all chunks in the parse_unified_log
function it just decompresses and parses one chunk and returns the data and pointer/offset to next chunk?
so instead of parse_unified_log(data: &[u8]) -> nom::IResult<&[u8], UnifiedLogData>
it would be something like parse_unified_log(data: File, offset: usize) -> nom::IResult<&[u8], UnifiedLogData>
The main thing to add would be something to track the next offset.
Thoughts on perhaps that approach?
Sorry @puffyCid it has been a while, we've been busy building our apps ! There's a first pull requests that was created here, do you want to hold off on it until we go further (as discussed above)?
no worries @mrguiman!
There is a few merge conflicts between the PR and main branch. If you could update/sync the PR to the latest version. I would be fine with committing.
For further reducing memory usage i think there are two approaches:
Both would be pretty large overhauls (1. being the largest). I think option 1. makes the most sense and keeps the public API/functions easier to use.
If you guys want to take a stab at it that would be great. Otherwise, if your too busy with other projects, no worries.
Time permitting I can try to implement it, with the goal having it merged sometime in September (hopefully)
I agree the option 1 makes the more sense, could really flatten the memory footprint (and we can build more iterators over this low level one).
It's a big refactoring, and difficult. We won't be able to dig anything until late august, If you want to take the lead, feel free to. If not we'll let you know when/if we deep dive into this.
Hello and thank you for the amazing work on the library.
We're working on an iOS forensics app and currently prototyping some code that will allow our researchers to pull specific, pinpointed events from UnifiedLogs. We based our code on one of your exemples, so it currently revolves around calling the
build_log
method to assembleLogData
and "send" them to be consumed by other parts of our code. In that context, and because we want our program to be as portable as possible, we're trying to have the lowest possible memory footprint and as such would like to be able to work with iterators rather than allocating an entire Vector of LogData which can be quite heavy.We were thinking of breaking down the
build_logs
method so that it can build vectors from the results of a newiter_logs
function that itself would return an iterator.Is there any footgun you can think of that we should be worrying about, and would you be open to us contributing such changes to the main codebase ?