Closed abmyii closed 5 years ago
Thanks for making your first PR here! This looks cleanly implemented and well thought out. My only concern could be people feeding iterators as rows
that cannot be read multiple times. I'm not 100% sure that would still work. Going to merge.
Thanks for merging! Could you please elaborate on the potential problem with iterator rows?
After looking in to the problem a bit, I submitted PR #299 to implement the handling of iterators.
This is my first PR - I hope you will accept it! I do understand, however, that it may require modification.
The project I am working on has dataset transactions which could be simplified and sped up by upsert_many, so I decided to attempt implementing it. Whilst doing so, I realised I also had to also implement update_many. After doing so, I started tinkering with the insert_many code and saw that _sync_columns was run on every row, which slows it down to around half-speed. By checking before inserting only, all of the non-existing columns are created and then no checks are required for the rest of the process.
The implementation of update_many is also very fast with 1,000,000 rows (with only one integer field) taking (9.12s) to update with a random integer, 28x faster than updates in a transaction (285.86s) and 30x faster than updates without a transaction (307.95s) in my test.
I also wrote some crude and unimaginative tests to go with the additions.
Closes https://github.com/pudo/dataset/issues/249. Thanks for this great library!