mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
215 stars 28 forks source link

I'd recommend to use parquet partitions #6

Closed veonua closed 3 years ago

veonua commented 3 years ago

it gives the smaller size and faster save\load time, while supported by the majority of data libraries

veonua commented 3 years ago

wallstreetbets_posts.csv > 920Mb wallstreetbets_posts.parquet ~ 120Mb

mattpodolak commented 3 years ago

Hi @veonua, thanks for pointing this out. However, this is not functionality of the pmaw library, but post-processing done after responses have been retrieved with pmaw so I will be closing this issue.

The documentation will continue to provide an example using .csv, as this benefits the largest number of users.

veonua commented 3 years ago

the proposal was for a cache as well

mattpodolak commented 3 years ago

ah okay, sorry, that wasn't clear. I know it takes up less space than .pickle which is currently being used, but I haven't done any benchmarks with it yet.

Added as a feature request - https://github.com/mattpodolak/pmaw/issues/7. Thanks!