Closed ctron closed 3 weeks ago
observation: avoiding entire documents being loaded into memory for processing, usually imply streaming approach during ingestion though I have no clue how one does this in rust with serde.
I fear a pattern we have for SBOMs is:
Holding on to results of every step, and only dropping all memory at the end.
observation: avoiding entire documents being loaded into memory for processing, usually imply streaming approach during ingestion though I have no clue how one does this in rust with serde.
In v1, we went to great pains to avoid loading an entire doc into memory, only to later realize that it was the only way we could accurately validate the document. In light of that experience, we avoided all that pain in v2, at the risk of memory exhaustion. We can probably strike a better balance, but we need to ensure that we can sufficiently validate a doc that doesn't fully reside in RAM. We should solve the validation problem first.
Holding on to results of every step, and only dropping all memory at the end.
Dropping the byte array after deserializing might be the lowest hanging fruit?
Dropping the byte array after deserializing might be the lowest hanging fruit?
array -> array ? yes Vec -> yes or I'm not understanding rust yet
Yep. Let's add some scope or explicit drops.
FWIW if validators also do not stream then that might be a problem addressed upstream (if we are validating with 3rd party parser).
Any thoughts on the best way to measure memory usage in rust programs? What's the simplest way I can determine what effect my changes make?
Holding on to results of every step, and only dropping all memory at the end.
I need help finding an example of this. When I look at the loaders, they each deserialize a byte buffer, e.g.
let csaf: Csaf =
info_span!("parse document").in_scope(|| serde_json::from_reader(document))?;
But because document
is moved into from_reader
, I believe it's dropped once the Csaf
struct is returned, right? Where is an example of us holding onto the bytes after deserialization?
Comparing today with yesterday, I see a signification improvement of memory usage. Maybe removing the format detection did that.
Another component I identified as memory hog during the load testing was the graphql endpoint. That bumped the memory usage from about 2 GiB to 16 GiB, just by using that endpoint with a single user.
I created a tracking issue for that: https://github.com/trustification/trustify/issues/750
@ctron @jcrossley3 can we declare victory on this ?
Strictly speaking, we probably have accomplished the goal as expressed in the title, so I'm ok closing this. We can create more specific issues as future perf challenges are identified.
Strictly speaking, we probably have accomplished the goal as expressed in the title, so I'm ok closing this. We can create more specific issues as future perf challenges are identified.
You're right!
I noticed that, at least when importing, the server might consume huge amounts of memory (~3GiB). I know that some documents are just big. But we should keep an eye on this and try to improve the situation where possible.