Closed Maximkaaa closed 3 years ago
Interesting
I'd like to find a 500Mb file to test out those improvements.
Removes Vector::resize() call in read_string_of_len function. Using vec![0; len] instead gives significant improvement.
Seems strange, I would have expected to two to be somewhat equivalent but that's cool
I did some more performance testing just to be sure. For it i used two files: 500 MB and 39 MB. The test code is just:
#[test]
fn large() {
let mut reader = Reader::from_path("./tests/data/gis_osm_water_a_free_1.dbf").unwrap();
let start = std::time::SystemTime::now();
reader.read();
let ellapsed = start.elapsed().unwrap().as_millis();
assert_eq!(0, ellapsed);
}
The results are:
39 MB 500 MB
Without modification 1067 ms 69513 ms
.resize() -> vec![0; len] 851 ms 63046 ms
All improvements 521 ms 21756 ms
Both release and debug builds show similar performance improvement.
Unfortunately, I cannot provide the 500 MB file I use as it's commercial data. But the smaller one I downloaded from Geofabric OSM download. There are larger ones also, so you can probably find some files for testing (https://download.geofabrik.de/europe.html).
I also updated the PR to make fmt check pass. And only after that I noticed that it fails in the file I didn't originally change.
This commit improves the way the values are read from the dbf files improving the speed of the file processing by almost 3 times.
The key points of improvement:
Removes
Vector::resize()
call inread_string_of_len
function. Usingvec![0; len]
instead gives significant improvement.Instead of using
String::trim()
on the read values (to remove spaces) we first skip the empty bytes, and then convert the value into the string. Removingtrim
calls improves performance almost doublefold.Using unchecked indexing on the vector when trimming spaces provides ~30% improvement in comparison to the safe version. Using unsafe here is not strictly necessary, but the logic in the unsafe block is quite straitforward so it should be ok.
Overall in my case reading a 500 MiB dbf file takes:
1 min 20 sec before change
34 sec after change