netzkolchose / django-computedfields

Provides autogenerated autoupdated database fields for model methods.
MIT License
94 stars 14 forks source link

better updatedata command #99

Closed jerch closed 2 years ago

jerch commented 2 years ago

WIP for a better updatedata command.

Planned:

Fixes #32.

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 2180197497


Changes Missing Coverage Covered Lines Changed/Added Lines %
computedfields/resolver.py 36 40 90.0%
computedfields/helper.py 15 20 75.0%
<!-- Total: 70 79 88.61% -->
Totals Coverage Status
Change from base Build 2169488544: -0.9%
Covered Lines: 1109
Relevant Lines: 1153

💛 - Coveralls
jerch commented 2 years ago

The querysize is a rather ugly issue, and needs to be applied in a bigger scope - basically any update_dependent call might pull a really big queryset deep down in the deps tree consuming GBs of RAM.

Example: This can already be provoked on the dummy models in exampleapp with 1000 records of Foo linking to 100 records of Bar each (100k total), again linking to 10 Baz each (1M total). Doing a update_dependent(Foo.objects.all()) with all fields out of sync needs ~2GB RAM in python with sqlite. Note the huge mem usage mainly comes from the select_related joins, but without these, the perf goes really down.

Solution:

Downside - This will sacrifice some performance in highly intersecting updates, now leading to repeated update attempts on depending records. Well, it's adjustable, so more an issue of lousy defaults.

jerch commented 2 years ago

Hmm iterator() is tricky to use, as it side-paths the queryset caching, which runs worse in deps chains, as we have to eval the queryset multiple times, at least once more for pk extraction. This either needs a rewrite of tree descent lookup in _querysets_for_update, or a cheaper goal would use queryset slices instead (which also should preserve prefetch rules, not tested yet...)

Edit: Solution is to use both iterator and slice notation, the latter being used in case we have found a prefetch optimization.

jerch commented 2 years ago

With the new slice_iterator and a reasonable COMPUTEDFIELDS_QUERYSIZE setting RAM usage can now be controlled pretty nicely. There is still a chance for higher consumption in deeper nested dependencies (we have to hold the symbols during descent)

The example from above with COMPUTEDFIELDS_QUERYSIZE=2000 (default value), usage never goes higher than 80MB. On a sidenote - the runtime of the loop variant benefits most from the smaller queryset footprints, with all fields desync:

There is one more catch in the current loop variant - it saves the instance no matter if the fields were in sync or not. This could be optimized the same as bulk_updater does internally. The difference is rather huge, when running updatedata with all fields in sync:

Comparing these numbers with update numbers above it seems, that for fast the update writing took only ~20s, while most of the time was spent querying the database for old values. In this light bulk looks pretty bad (writing takes ~4 min), for loop we dont know the difference yet.

Performance-wise the holy grail for updatedata would be, if it could figure out a BSF-flavored topsort on tables/model fields. For the example above in fact all it would have to do, is an update_dependent(Foo.objects.all()), which finishes in 00:38 (all desync) resp. 00:24 (all sync) with fast. But thats very tricky to get done right (not even possible for all cases), and needs fundamental changes in the graph handling. Thus it is left for a later exercise.

jerch commented 2 years ago

The numbers above are for updating/cascading CFs only on selected models (equal to calling for model in models: update_dependent(model.objects.all(), update_fields=cfs_on_model)), which is prolly surprising/unwanted behavior. The old updatedata command simply did a full for inst in qs: inst.save() roundtrip without any field narrowing.

The last commit restores that behavior, as it is much easier to comprehend (but also much much more expensive for the loop variant).

At this point I think the semantics of updatedata should stick to the "trigger updates FROM all fields" idea, as it is directly compatible to for model in models: update_dependent(model.objects.all()).

A third variant would do full "horizontal" updates on models without any descent (in conjunction with proper topsort from above also prolly the fast one). Big limitation - would only work for non-cycling deps, and only for full project-wide updates. Anything less would introduce desyncs from updatedata itself.