Faster dataset loading - Githubissues

symerio / pgeocode

Postal code geocoding and distance calculation

https://pgeocode.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

231 stars 57 forks source link

Faster dataset loading #6

Closed rth closed 1 year ago

rth commented 5 years ago

Currently we load datasets with pd.read_csv from gzipped CSV format. Loading should be much improved by converting the data to parquet format and using pd.read_parquet (this might also reduce the size of downloads when using e.g. snappy compression).

Though the limitations of this approach is that datasets would need to be hosted somewhere and a new dependency (pyarrow) would need to be added. I'm not sure that it would be worth it.

rth commented 4 years ago

Also for caching, it might make sense to use pickle instead of csv. Though then it's a less portable across python versions.

rth commented 1 year ago

Closing as not critical. Unless someone feels it's too slow currently.