pudo / dataset

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.
https://dataset.readthedocs.org/
MIT License
4.76k stars 297 forks source link

Implement handling of iterator rows #299

Closed abmyii closed 4 years ago

abmyii commented 5 years ago

Updated the input_many and update_many functions so that an iterator rows input will be handled properly, as was mentioned by @pudo in my last pull request (#298). It may be a bit over-complicated, so may require modification.

pudo commented 4 years ago

Wow that's quite a bit of magic :) Can you help me understand what the benefit of this over doing the following would be (my sense is that that's what itertools.tee does internally):

values = list(iterator())
table.insert_many(values)

This reminds me we need to document the dataset.chunked.ChunkInsert -- that's what we use internally for handling streaming insert...

abmyii commented 4 years ago

Thanks! I like to be concise with my statements, so I spent quite a bit of time refining the code.

I believe (not 100% sure and cannot remember the research I did before) thatitertools.tee makes multiple generators which takes up less memory than 3/4 copies of the same data. It also means that generators passed into the function don't have to be converted to list.

TBH, I can't remember why and I doubt it would make a big difference on perf, but that would need testing to prove (IIRC (very vauge) I may have actually done so).

abmyii commented 4 years ago

Perhaps it would only be worth it if we did test the performance and prove it is actually faster.

pudo commented 4 years ago

Just to be clear: my concern isn't so much about performance (as in CPU time), it's about memory usage. Offering an iterator interface, to me, is making a promise that you can insert any amount of data (ten records, a million records, half a billion records) and use a quasi constant amount of memory.

abmyii commented 4 years ago

Yup, by "perf" I mean memory performance. Shall I profile the two?

P.S. Nice to see you back after the interim!

pudo commented 4 years ago

I think it'd be great to see a comparison of the memory use, including the ChunkedIterator, list() and this PR. My hunch is still that itertools.tee() has no magic, therefore making this roughly equivalent to running list().

abmyii commented 4 years ago

Any idea of the best way to do that? I was thinking of using the memory_profiler module on code that inputs 100,000 rows and test with(out) itertools.

pudo commented 4 years ago

Yes, although I think it'd be fun to try something like a million or two, too.

abmyii commented 4 years ago

Here are the memory_profiler results with an insert of 500,000 rows and then an upsert of the same rows.

test.py:

import dataset
import random

db = dataset.connect('sqlite:///:memory:')
table = db['sometable']

@profile
def main():
    # insert
    data = []
    for i in range(0, 500000):
        data.append(dict(id=i, age=random.randint(0, 100)))
    table.insert_many(data)

    # upsert
    data2 = []
    for i in range(0, 500000):
        data2.append(dict(id=i, age=random.randint(0, 100)))
    table.upsert_many(data2, 'id')

if __name__ == '__main__':
    main()

With itertools.tee:

Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     8   60.477 MiB   60.477 MiB   @profile
     9                             def main():
    10                                 # insert
    11   60.477 MiB    0.000 MiB       data = []
    12  196.777 MiB    0.258 MiB       for i in range(0, 500000):
    13  196.777 MiB    1.031 MiB           data.append(dict(id=i, age=random.randint(0, 100)))
    14  202.566 MiB    5.789 MiB       table.insert_many(data)
    15                             
    16                                 # upsert
    17  202.566 MiB    0.000 MiB       data2 = []
    18  338.406 MiB    0.254 MiB       for i in range(0, 500000):
    19  338.406 MiB    0.512 MiB           data2.append(dict(id=i, age=random.randint(0, 100)))
    20  369.918 MiB   31.512 MiB       table.upsert_many(data2, 'id')

Without:

Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     8   60.695 MiB   60.695 MiB   @profile
     9                             def main():
    10                                 # insert
    11   60.695 MiB    0.000 MiB       data = []
    12  196.527 MiB    0.258 MiB       for i in range(0, 500000):
    13  196.527 MiB    0.770 MiB           data.append(dict(id=i, age=random.randint(0, 100)))
    14  202.266 MiB    5.738 MiB       table.insert_many(data)
    15                             
    16                                 # upsert
    17  202.266 MiB    0.000 MiB       data2 = []
    18  338.160 MiB    0.258 MiB       for i in range(0, 500000):
    19  338.160 MiB    0.773 MiB           data2.append(dict(id=i, age=random.randint(0, 100)))
    20  366.188 MiB   28.027 MiB       table.upsert_many(data2, 'id')

So, surprisingly, itertools.tee uses 3 MB more memory than without - so can be safely ditched. My assumption that itertools reused a single generator, rather than using 3 copies of the data was incorrect.

It took a darned long time to get those results - around 40 mins per run, so I think if I test this again I'd use a much smaller number of rows.

abmyii commented 4 years ago

Not sure how to change the file in the PR now (shows unknown repository for whatever reason), so if you could do so I'd appreciate it. The "patch" is this:

diff --git a/dataset/table.py b/dataset/table.py
index 5993b07..d3ca447 100644
--- a/dataset/table.py
+++ b/dataset/table.py
@@ -129,15 +129,12 @@ class Table(object):
             rows = [dict(name='Dolly')] * 10000
             table.insert_many(rows)
         """
-        # Create multiple instances of rows, needed if it is an iterator.
-        instances = itertools.tee(rows, 3)
-
         # Count the number of rows.
-        num_rows = universal_len(instances[0])
+        num_rows = universal_len(rows)

         # Sync table before inputting rows.
         sync_row = {}
-        for row in instances[1]:
+        for row in rows:
             # Only get non-existing columns.
             for key in set(row.keys()).difference(set(sync_row.keys())):
                 # Get a sample of the new column(s) from the row.
@@ -148,7 +145,7 @@ class Table(object):
         columns = sync_row.keys()

         chunk = []
-        for index, row in enumerate(instances[2]):
+        for index, row in enumerate(rows):
             chunk.append(row)

             # Insert when chunk_size is fulfilled or this is the last row
@@ -195,18 +192,15 @@ class Table(object):
         See :py:meth:`update() <dataset.Table.update>` for details on
         the other parameters.
         """
-        # Create multiple instances of rows, needed if it is an iterator.
-        instances = itertools.tee(rows, 2)
-
         # Count the number of rows.
-        num_rows = universal_len(instances[0])
+        num_rows = universal_len(rows)

         # Convert keys to a list if not a list or tuple.
         keys = keys if type(keys) in (list, tuple) else [keys]