petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.22k stars 190 forks source link

Fix fromdictsgenerator separate function #624

Closed arturponinski closed 1 year ago

arturponinski commented 1 year ago

This PR has the objective of improving the support of generators in fromdicts. THe current implementation uses itertools.tee which according to docs and production deployments uses large amounts of memory, leading to out of memory kills of processes. This PR aims to revert the functionality and introduce a dedicated fromdictsgenerator function, which uses filecache, similar to sorting.

Changes

  1. Added new fromdictsgenerator to petl.io.json
  2. Reverted the detection of generators in fromdicts introduced in 1.7.5 and added a warning about passing the generator to fromdicts
  3. Added info about the function to docs
  4. Moved _iterchunk from sorts to petl.util.base. Imported to sorts as _iterchunk for BC

Checklist

Use this checklist for assuring the quality of pull requests that include new code and or make changes to existing code.

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 2659590742


Changes Missing Coverage Covered Lines Changed/Added Lines %
petl/io/json.py 40 42 95.24%
<!-- Total: 87 89 97.75% -->
Totals Coverage Status
Change from base Build 2458237953: 0.03%
Covered Lines: 12693
Relevant Lines: 13936

💛 - Coveralls
bmaggard commented 1 year ago

As we are not bumping releases, should this be handled transparently in fromdicts() as it currently is, so as not to break API?

juarezr commented 1 year ago

As we are not bumping releases, should this be handled transparently in fromdicts() as it currently is, so as not to break API?

Feedback from production deployments always looks like an excellent way to improve.

@arturponinski,

As you have explored this issue closer:

arturponinski commented 1 year ago

@bmaggard @juarezr the more I think about it, the more I lean to your suggestion to keep the support in the fromdicts.

  1. Regarding the flag, adding it would make it not used by anyone in my opinion.
  2. It seems as a regression, yes. This issue made us revert the petl update to the previous version we used
  3. I think using temporary files should not be a major issue now, especially that sorting also uses files for cache. Perhaps updating the docs, stating that files are used for cache when generators are passed is worth considering.

Due to above, I've created another PR which keeps the BC and uses the file cache: https://github.com/petl-developers/petl/pull/625

arturponinski commented 1 year ago

Superseded by https://github.com/petl-developers/petl/pull/625