It's 3x slower than pandas for larger files

juancarlospaco / faster-than-csv

Faster CSV for Python

https://juancarlospaco.github.io/faster-than-csv

MIT License

99 stars 8 forks source link

It's 3x slower than pandas for larger files #9

Closed kootenpv closed 4 years ago

kootenpv commented 4 years ago

It would be a good idea to include a benchmark with larger, more heterogenous data.

juancarlospaco commented 4 years ago

Provide full repro code and minimal sample data.

kootenpv commented 4 years ago

Please search on github for csv files larger than 50mb, plenty of examples (on my phone now)

juancarlospaco commented 4 years ago

No I meant to provide full repro code, and a minimal tiny sample data, I dont want the repo to become slow harder to clone because of sample data, but I am more interested on the code that you are using to debug the bug.

Also maybe now other libs have improved, when this lib started, it was a lot faster than others, also required no dependencies, while other had tons of dependencies, I think thats important nowadays that everyone uses Docker and Alpine, if other libs improved good for them.

kootenpv commented 4 years ago

Ah my bad. I just compared:

csv2list (3s) against pd.read_csv (1s) for a 80MB file of data (cannot share). csv2dict crashed on it.

juancarlospaco commented 4 years ago

Try the new version should be a little better: sudo pip3 install faster-than-csv==edca7e6 --no-binary :all: You can pass the columns count argument to get a little bit better performance, or leave it as 0 otherwise. csv2list() should work now.