Parsing performance opportunities

marxin commented 3 months ago

The following great optimization hints were destined as a part of the discussion in https://github.com/mstange/samply/issues/290:

I couldn't resist looking at your profile and had some ideas for optimizations.

These are all about RPM parsing inside the rpm crate:

rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);

rpm::rpm::package::PackageMetadata::get_file_entries collects a lot of metadata that you don't need; you only need the paths and the links. There's already a method called get_file_paths to get just the paths; maybe the rpm crate can add another one to get just the links, e.g. get_file_links.

IndexData::StringArray uses nom's complete::take_till to find the nul byte. It could instead use the memchr crate for SIMD-accelerated nul byte finding, and then it can do the slicing without nom.

This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);

String::from_utf8_lossy is called for all strings in all header entries, even for entry types that you don't look at. And in then all these strings get converted into unix paths anyway, which don't need to be utf-8. Furthermore, the string bytes are copied twice: First from the file into a temporary Vec, and then from that Vec into the the header entries. I think one could rework header parsing as follows, with some breaking changes for the API of the rpm crate: During Header::parse, only read the bytes from the input and store them in a Vec that becomes a permanent part of the header. Also parse the IndexHeader and the list of IndexEntrys, and sanity-check the size of the remaining bytes given the sizes expected by the entries. But don't do any entry-type specific parsing here. Only once somebody calls Header::get_entry_data_as_string_array or one of its friends, get the slice for the entry data from the Vec that's stored in the header. At this point the bytes can be converted into the expected format. For the string array cases, I would just compute byte slices into the header Vec and not convert those slices to strings (don't copy, don't utf-8 validate). Then, when get_file_paths makes the paths, it can just call Path::new(OsStr::from_bytes(byte_slice)) (at least if the target is unix), which would let it skip the utf-8 parsing.

marxin commented 3 months ago

rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);

Good catch! I've just addressed that in #235.

marxin commented 3 months ago

This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);

Yes, that really helps and it's addressed in #237.

rpm-rs / rpm

Parsing performance opportunities #234