better updatedata command

jerch commented 2 years ago

WIP for a better updatedata command.

Planned:

[x] serveral update modes:
- [x] loop - mostly like the old updatedata command with bad runtime
- [x] bulk - new mode using bulk_update (most of the time significantly faster, but not always)
- [x] fast - new mode using fast_update (speedy as hell)
[x] some API enhancements on bulk_updater needed
- [x] querysize argument to slice into big querysets (avoids bad mem usage on big select/prefetch joins)
- [x] ~some signal/slice hooks~ dealt with by slicing into smaller portions for small querysets for now
- [x] move select/prefetch logic to own methods
[ ] ~maybe: topsort on intermodel deps with only local updates~ too much to get it done in a timely manner
[x] checkdata command to get an idea of the sync state of cfs
[x] showdependencies command
[ ] ~some tests for the command modes~ skipped for now, as testing those is very cumbersome from automated CI testing
[x] update docs for updatedata
[x] docs for new global settings (querysize)
[x] docs for new commands

Fixes #32.

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 2180197497

70 of 79 (88.61%) changed or added relevant lines in 7 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.9%) to 94.957%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
computedfields/resolver.py	36	40	90.0%
computedfields/helper.py	15	20	75.0%
<!--	Total:	70	79	88.61%	-->

Totals
Change from base Build 2169488544:	-0.9%
Covered Lines:	1109
Relevant Lines:	1153

💛 - Coveralls

jerch commented 2 years ago

The querysize is a rather ugly issue, and needs to be applied in a bigger scope - basically any update_dependent call might pull a really big queryset deep down in the deps tree consuming GBs of RAM.

Example: This can already be provoked on the dummy models in exampleapp with 1000 records of Foo linking to 100 records of Bar each (100k total), again linking to 10 Baz each (1M total). Doing a update_dependent(Foo.objects.all()) with all fields out of sync needs ~2GB RAM in python with sqlite. Note the huge mem usage mainly comes from the select_related joins, but without these, the perf goes really down.

Solution:

slice the queryset into batches at update_dependent level with qs.iterator(querysize)
revamp bulk_updater to a multi side call digesting the batches, do the tree descent in update_dependent instead
introduce a project wide QUERYSIZE
allow custom override at cf declaration to lower the querysize for really heavy fields
finally inupdate_dependent apply lowest querysize setting from used cfs

Downside - This will sacrifice some performance in highly intersecting updates, now leading to repeated update attempts on depending records. Well, it's adjustable, so more an issue of lousy defaults.

jerch commented 2 years ago

Hmm iterator() is tricky to use, as it side-paths the queryset caching, which runs worse in deps chains, as we have to eval the queryset multiple times, at least once more for pk extraction. This either needs a rewrite of tree descent lookup in _querysets_for_update, or a cheaper goal would use queryset slices instead (which also should preserve prefetch rules, not tested yet...)

Edit: Solution is to use both iterator and slice notation, the latter being used in case we have found a prefetch optimization.

jerch commented 2 years ago

With the new slice_iterator and a reasonable COMPUTEDFIELDS_QUERYSIZE setting RAM usage can now be controlled pretty nicely. There is still a chance for higher consumption in deeper nested dependencies (we have to hold the symbols during descent)

The example from above with COMPUTEDFIELDS_QUERYSIZE=2000 (default value), usage never goes higher than 80MB. On a sidenote - the runtime of the loop variant benefits most from the smaller queryset footprints, with all fields desync:

fast: from 01:30 down to 00:58
bulk: from 05:30 down to 04:12
loop: from 16:20 down to 09:07

There is one more catch in the current loop variant - it saves the instance no matter if the fields were in sync or not. This could be optimized the same as bulk_updater does internally. The difference is rather huge, when running updatedata with all fields in sync:

fast/bulk: 00:40 (both share the same code up to the update writing)
loop: 08:55

Comparing these numbers with update numbers above it seems, that for fast the update writing took only ~20s, while most of the time was spent querying the database for old values. In this light bulk looks pretty bad (writing takes ~4 min), for loop we dont know the difference yet.

Performance-wise the holy grail for updatedata would be, if it could figure out a BSF-flavored topsort on tables/model fields. For the example above in fact all it would have to do, is an update_dependent(Foo.objects.all()), which finishes in 00:38 (all desync) resp. 00:24 (all sync) with fast. But thats very tricky to get done right (not even possible for all cases), and needs fundamental changes in the graph handling. Thus it is left for a later exercise.

jerch commented 2 years ago

The numbers above are for updating/cascading CFs only on selected models (equal to calling for model in models: update_dependent(model.objects.all(), update_fields=cfs_on_model)), which is prolly surprising/unwanted behavior. The old updatedata command simply did a full for inst in qs: inst.save() roundtrip without any field narrowing.

The last commit restores that behavior, as it is much easier to comprehend (but also much much more expensive for the loop variant).

At this point I think the semantics of updatedata should stick to the "trigger updates FROM all fields" idea, as it is directly compatible to for model in models: update_dependent(model.objects.all()).

A third variant would do full "horizontal" updates on models without any descent (in conjunction with proper topsort from above also prolly the fast one). Big limitation - would only work for non-cycling deps, and only for full project-wide updates. Anything less would introduce desyncs from updatedata itself.

netzkolchose / django-computedfields

better updatedata command #99

Pull Request Test Coverage Report for Build 2180197497

💛 - Coveralls