petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.25k stars 193 forks source link

Generator support in fromdicts requires large amount of memory #618

Closed arturponinski closed 2 years ago

arturponinski commented 2 years ago

The PR: https://github.com/petl-developers/petl/issues/569 which introduced generators support in fromdicts has increased memory usage on our production instances.

Problem description

Per itertools.tee docs:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

This most likely is the cause. Due to this, the generator support should:

  1. Be moved to a separate method, ie. fromdictsgenerator
  2. The method should use a temporary file, similarly to how SortView does
bmaggard commented 2 years ago

The problem description does not describe a "memory leak" Perhaps something like "Generator support in fromdicts requires large amounts of memory" would be a more appropriate title?

arturponinski commented 2 years ago

Fair point, description updated