pudo / dataset

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.
https://dataset.readthedocs.org/
MIT License
4.78k stars 298 forks source link

Implement update_many, upsert_many and refactor for a 2x speed-up of insert_many #298

Closed abmyii closed 5 years ago

abmyii commented 5 years ago

This is my first PR - I hope you will accept it! I do understand, however, that it may require modification.

The project I am working on has dataset transactions which could be simplified and sped up by upsert_many, so I decided to attempt implementing it. Whilst doing so, I realised I also had to also implement update_many. After doing so, I started tinkering with the insert_many code and saw that _sync_columns was run on every row, which slows it down to around half-speed. By checking before inserting only, all of the non-existing columns are created and then no checks are required for the rest of the process.

The implementation of update_many is also very fast with 1,000,000 rows (with only one integer field) taking (9.12s) to update with a random integer, 28x faster than updates in a transaction (285.86s) and 30x faster than updates without a transaction (307.95s) in my test.

I also wrote some crude and unimaginative tests to go with the additions.

Closes https://github.com/pudo/dataset/issues/249. Thanks for this great library!

pudo commented 5 years ago

Thanks for making your first PR here! This looks cleanly implemented and well thought out. My only concern could be people feeding iterators as rows that cannot be read multiple times. I'm not 100% sure that would still work. Going to merge.

abmyii commented 5 years ago

Thanks for merging! Could you please elaborate on the potential problem with iterator rows?

abmyii commented 5 years ago

After looking in to the problem a bit, I submitted PR #299 to implement the handling of iterators.