Closed abmyii closed 4 years ago
Wow that's quite a bit of magic :) Can you help me understand what the benefit of this over doing the following would be (my sense is that that's what itertools.tee
does internally):
values = list(iterator())
table.insert_many(values)
This reminds me we need to document the dataset.chunked.ChunkInsert
-- that's what we use internally for handling streaming insert...
Thanks! I like to be concise with my statements, so I spent quite a bit of time refining the code.
I believe (not 100% sure and cannot remember the research I did before) thatitertools.tee
makes multiple generators which takes up less memory than 3/4 copies of the same data. It also means that generators passed into the function don't have to be converted to list.
TBH, I can't remember why and I doubt it would make a big difference on perf, but that would need testing to prove (IIRC (very vauge) I may have actually done so).
Perhaps it would only be worth it if we did test the performance and prove it is actually faster.
Just to be clear: my concern isn't so much about performance (as in CPU time), it's about memory usage. Offering an iterator interface, to me, is making a promise that you can insert any amount of data (ten records, a million records, half a billion records) and use a quasi constant amount of memory.
Yup, by "perf" I mean memory performance. Shall I profile the two?
P.S. Nice to see you back after the interim!
I think it'd be great to see a comparison of the memory use, including the ChunkedIterator
, list()
and this PR. My hunch is still that itertools.tee()
has no magic, therefore making this roughly equivalent to running list()
.
Any idea of the best way to do that? I was thinking of using the memory_profiler
module on code that inputs 100,000 rows and test with(out) itertools
.
Yes, although I think it'd be fun to try something like a million or two, too.
Here are the memory_profiler
results with an insert of 500,000 rows and then an upsert of the same rows.
test.py
:
import dataset
import random
db = dataset.connect('sqlite:///:memory:')
table = db['sometable']
@profile
def main():
# insert
data = []
for i in range(0, 500000):
data.append(dict(id=i, age=random.randint(0, 100)))
table.insert_many(data)
# upsert
data2 = []
for i in range(0, 500000):
data2.append(dict(id=i, age=random.randint(0, 100)))
table.upsert_many(data2, 'id')
if __name__ == '__main__':
main()
With itertools.tee
:
Filename: test.py
Line # Mem usage Increment Line Contents
================================================
8 60.477 MiB 60.477 MiB @profile
9 def main():
10 # insert
11 60.477 MiB 0.000 MiB data = []
12 196.777 MiB 0.258 MiB for i in range(0, 500000):
13 196.777 MiB 1.031 MiB data.append(dict(id=i, age=random.randint(0, 100)))
14 202.566 MiB 5.789 MiB table.insert_many(data)
15
16 # upsert
17 202.566 MiB 0.000 MiB data2 = []
18 338.406 MiB 0.254 MiB for i in range(0, 500000):
19 338.406 MiB 0.512 MiB data2.append(dict(id=i, age=random.randint(0, 100)))
20 369.918 MiB 31.512 MiB table.upsert_many(data2, 'id')
Without:
Filename: test.py
Line # Mem usage Increment Line Contents
================================================
8 60.695 MiB 60.695 MiB @profile
9 def main():
10 # insert
11 60.695 MiB 0.000 MiB data = []
12 196.527 MiB 0.258 MiB for i in range(0, 500000):
13 196.527 MiB 0.770 MiB data.append(dict(id=i, age=random.randint(0, 100)))
14 202.266 MiB 5.738 MiB table.insert_many(data)
15
16 # upsert
17 202.266 MiB 0.000 MiB data2 = []
18 338.160 MiB 0.258 MiB for i in range(0, 500000):
19 338.160 MiB 0.773 MiB data2.append(dict(id=i, age=random.randint(0, 100)))
20 366.188 MiB 28.027 MiB table.upsert_many(data2, 'id')
So, surprisingly, itertools.tee
uses 3 MB more memory than without - so can be safely ditched. My assumption that itertools reused a single generator, rather than using 3 copies of the data was incorrect.
It took a darned long time to get those results - around 40 mins per run, so I think if I test this again I'd use a much smaller number of rows.
Not sure how to change the file in the PR now (shows unknown repository
for whatever reason), so if you could do so I'd appreciate it. The "patch" is this:
diff --git a/dataset/table.py b/dataset/table.py
index 5993b07..d3ca447 100644
--- a/dataset/table.py
+++ b/dataset/table.py
@@ -129,15 +129,12 @@ class Table(object):
rows = [dict(name='Dolly')] * 10000
table.insert_many(rows)
"""
- # Create multiple instances of rows, needed if it is an iterator.
- instances = itertools.tee(rows, 3)
-
# Count the number of rows.
- num_rows = universal_len(instances[0])
+ num_rows = universal_len(rows)
# Sync table before inputting rows.
sync_row = {}
- for row in instances[1]:
+ for row in rows:
# Only get non-existing columns.
for key in set(row.keys()).difference(set(sync_row.keys())):
# Get a sample of the new column(s) from the row.
@@ -148,7 +145,7 @@ class Table(object):
columns = sync_row.keys()
chunk = []
- for index, row in enumerate(instances[2]):
+ for index, row in enumerate(rows):
chunk.append(row)
# Insert when chunk_size is fulfilled or this is the last row
@@ -195,18 +192,15 @@ class Table(object):
See :py:meth:`update() <dataset.Table.update>` for details on
the other parameters.
"""
- # Create multiple instances of rows, needed if it is an iterator.
- instances = itertools.tee(rows, 2)
-
# Count the number of rows.
- num_rows = universal_len(instances[0])
+ num_rows = universal_len(rows)
# Convert keys to a list if not a list or tuple.
keys = keys if type(keys) in (list, tuple) else [keys]
Updated the input_many and update_many functions so that an iterator rows input will be handled properly, as was mentioned by @pudo in my last pull request (#298). It may be a bit over-complicated, so may require modification.