tmontaigu / dbase-rs

Rust library to read & write dBase files.
MIT License
29 stars 30 forks source link

Improve performance of reading large files #22

Closed Maximkaaa closed 3 years ago

Maximkaaa commented 3 years ago

As a follow-up for #21 , I've checked how long it takes for other applications to deal with large shapefiles (with dbfs).

It takes shp2pgsql less then 6 sec to read a shapefile with 500 MB dbf attribute data. This is around 4 times faster then just reading dbf records with dbase-rs (after applying #21 ). It does not use multithreading or other dirty hacks... So there are clearly ways to improve performance.

@tmontaigu If you are interested in improving this crate, I'm willing to invest some time in investigating and trying different ideas.

Probably, someone with some knowledge of C can help with producing such ideas from shp2pgsql source (https://postgis.net/docs/doxygen/3.2/d8/da3/shp2pgsql-core_8c_source.html).

tmontaigu commented 3 years ago

sh2pgsql seems to use shapelib, so most of reading performance would comme from here I think. https://github.com/OSGeo/shapelib/blob/21ae8fc16afa15a1b723077b6cec3a9abc592f6a/dbfopen.c#L946

However by looking at the code I don't see anything extra ordinary which makes me think thats dbase-rs is doing something bad.

The best way would be to profile the code, however I'm on Windows and last time I tried to profile Rust code it was painful, i'll see If can get some results or boot into a linux but that probably won't be after a few weeks

tmontaigu commented 3 years ago

One notable thing I see is the fact that they use a buffer to store the current record they are parsing: Read the whole record bytes into an in memory buffer, then the functions that read a record fields look into that buffer.

In dbase-rs, fields are read one by one the buffering is handled by the BufReader but maybe that is not enough, and we should have a buffer that holds the current Record. An very quick profiling seems to tell me that we spend an lot of time in the read_exact function of the BufReader<File>, so that may be worth to try.

Maximkaaa commented 3 years ago

I think we can close this issue. Even if there are further improvements possible, currently the main limiting factor for most practical applications is IO speed. @tmontaigu Thanks so much for your help! Do you plan to release a new version of the crate?

tmontaigu commented 3 years ago

Yep