skeeto / emacs-memoize

Elisp memoization functions
The Unlicense
54 stars 16 forks source link

Add: By-key memoization #10

Open alphapapa opened 6 years ago

alphapapa commented 6 years ago

Hi Chris,

It was suggested that I add caching to org-ql, so I tried using memoize-by-buffer-contents, but unfortunately it proved very slow in the buffers I was testing on, taking more than twice as long as the unmemoized version, and spending about half its time in GC. I guess that's the price paid to hash buffer-string.

So I came up with a version that uses buffer-modified-tick to determine whether a buffer has changed, and it works very well, reducing the runtime on my test from 0.876 seconds to 0.006 seconds.

I added it to memoize as memoize-by-key, which takes a key function, which returns a key to be used in the hash table. I use it like this:

(memoize-by-key #'org-ql--select (lambda (&rest args)
                                   (list (current-buffer) (buffer-modified-tick))))

It's working very well, so I hope you'll accept this PR. I can imagine this being useful in many other packages as well, like my own helm-org-rifle, or anything that returns a value based on buffer contents.

While I was at it, I rewrote the readme, as there didn't seem to be a natural place to mention the new function.

Thanks.

alphapapa commented 6 years ago

Hashing byte-compiled functions

Well, somewhat orthogonal to this PR, I realized that I made a mistake in my own code that uses this PR's code. I need to include the key function's args in the key. "Normally" this wouldn't be a problem, but in my case, one of those args is a byte-compiled function, and unfortunately, byte-compiled functions do not hash to the same value every time! (I'm guessing you already know this and know why, so maybe you can help me.)

For example:

(sxhash '(:predicate
          #[nil "\300\301!\205\0\302\303!\205\0\302\304!\207"
                [todo "UNDERWAY" regexp "Helm" "Emacs"]
                2]
          :action
          #[nil "\300\301 !\207"
                [org-element-headline-parser line-end-position]
                2]
          :narrow nil))  ;; => 4033673295264
(sxhash '(:predicate
          #[nil "\300\301!\205\0\302\303!\205\0\302\304!\207"
                [todo "UNDERWAY" regexp "Helm" "Emacs"]
                2]
          :action
          #[nil "\300\301 !\207"
                [org-element-headline-parser line-end-position]
                2]
          :narrow nil))  ;; => 8684217789088

Or:

(sxhash (byte-compile (lambda ()
                        (org-element-headline-parser (line-end-position)))))  ;; => 44696449
(sxhash (byte-compile (lambda ()
                        (org-element-headline-parser (line-end-position)))))  ;; => 59007205

This is a bit frustrating, because it breaks caching for me.

Can you shine any light on this, and perhaps suggest a workaround?

Expiring from cache

For this PR's code, I also realized that this by-key hash-table might grow forever. Do we need to expire old entries? I don't know exactly how "weakness" works, so I don't know if that would be suitable here.

Thanks.

yantar92 commented 6 years ago

Hashing byte-compiled functions

Printed representation of the byte-compiled functions leads to consistent hash.

(sxhash (prin1-to-string '(:predicate
               #[nil "\300\301!\205�\0\302\303!\205�\0\302\304!\207"
                 [todo "UNDERWAY" regexp "Helm" "Emacs"]
                 2]
               :action
               #[nil "\300\301 !\207"
                 [org-element-headline-parser line-end-position]
                 2]
               :narrow nil))) ;; => -1492828891608187395 (#o255103200411520176775, #x2b4868084d40fdfd)
(sxhash (prin1-to-string '(:predicate
               #[nil "\300\301!\205�\0\302\303!\205�\0\302\304!\207"
                 [todo "UNDERWAY" regexp "Helm" "Emacs"]
                 2]
               :action
               #[nil "\300\301 !\207"
                 [org-element-headline-parser line-end-position]
                 2]
               :narrow nil))) ;; => -1492828891608187395 (#o255103200411520176775, #x2b4868084d40fdfd)

Expiring from cache

There is the expiry mechanism in memoize--wrap. Probably, it should be also added to the memoize-by-buffer-contents--wrap and this PR.

alphapapa commented 6 years ago

Printed representation of the byte-compiled functions leads to consistent hash.

Thanks, that is a clever solution. I should be able to use that in a custom hash-test function in org-ql.

There is the expiry mechanism in memoize--wrap. Probably, it should be also added to the memoize-by-buffer-contents--wrap and this PR.

That might be a good idea. However, some use cases might want values to only expire when the key changes, rather than after a timeout. For example, in my case, time is irrelevant, and results should be cached as long as the buffer is open.

I guess what I need is some kind of multi-stage cache, because the logic is something like this:

  1. Find cached results for (list (current-buffer) (buffer-modified-tick)).
  2. In those results, find cached results for args.

If a buffer is closed, its results should be invalidated. If a buffer's modified-tick changes, those results should be invalidated.

So a "dumb" cache that only expired by timeout would prevent memory leaks, but it would unnecessarily invalidate results, making the cache less effective.

Ideally we would come up with a solution that would be generic enough to fit in this package while allowing me to customize it for my needs. Maybe I need to think about it some more. But help is welcome. :)

Thanks.

npostavs commented 6 years ago

byte-compiled functions do not hash to the same value every time!

This looks like a bug in Emacs, sxhash treats non-eq but equal byte-compiled functions differently.

(let ((obj1 (byte-compile (lambda (x) x)))
      (obj2 (byte-compile (lambda (x) x))))
  (list (equal obj1 obj2)
        (eq obj1 obj2)
        (= (sxhash obj1)
           (sxhash obj2))))
;=> (t nil nil)

Whereas (elisp) Defining Hash says:

 If two objects OBJ1 and OBJ2 are equal, then ‘(sxhash OBJ1)’ and
 ‘(sxhash OBJ2)’ are the same integer.
alphapapa commented 6 years ago

For org-ql I ended up implementing caching myself, because I think it's probably too specific for memoize to handle. https://github.com/alphapapa/org-ql/commit/e728dec64e1d0bf3f041dc15e7ba56d1fba40903

However, the memoize-by-key idea still seems like it could be generally useful, so, Chris, if you are interested, I will continue to work on the PR until you think it's ready to merge. Let me know.

@npostavs Thanks, I guess I'll have to report that...

Edit: Bug reported.

skeeto commented 6 years ago

I like this interface. It's a lot like predicate dispatch. (Though I'm on the fence whether predicate dispatch is elegant or rarely ever worth its cost. Probably both.)

The :test argument to make-hash-table is not a function, even though it really looks like it. It's really just a symbol choosing from a limited set of built-in comparison functions. As such, those "#'equal" should just be "'equal" since it's not actually referring to the function.

I'd like something more correct than "memoize-by-key--nil" since it's so easy to fix. What if the user (who is obviously intentionally trying to be difficult for us) wants to memoize over that specific symbol? The package doesn't really "own" that symbol as a value despite the prefixing. By convention a package owns the global/dynamic variable and function namespaces with its prefix, but not necessarily the interned symbols themselves.

Instead, create a unique, uninterned symbol (via make-symbol) alongside the hash table, right there in the let. Then it's a value truly owned by the package and is not a legitimate return value for other functions.

Also, could you add an test for memoize-by-key? (In case you weren't aware, you can run "make test" to run the existing tests.)

Fix those three things (sharp-quote, nil value, and a test) and I'll accept this PR.

As @npostavs pointed out, that is indeed an Emacs bug. For function objects, sxhash derives from the pointer address of the object (see src/fns.c:4348 in 26.1), but equal actually examines the object. Instead sxhash must hash function objects just as it does vectors in order to be consistent with equal. Until then they cannot be used as hash table keys.

@yantar92's suggestion with printing keys containing function objects is a useful workaround until this gets fixed in Emacs. Anyone want to open an Emacs bug for this?

The use of key weakness in memoize-by-buffer-contents is wrong, so just don't model any cache-clearing strategy after that. It "works" but it generally clears entries too aggressively (on every GC event).

The cache clearing issue is where a generic memoization solution really falls apart, and a custom tailored solution is more appropriate than a memoization library. The proper time to clear cache entries depends entirely on the semantics and usage of the memoized function, and it isn't really something that can be automated or tuned for a one-size-fits-all.

That being said, I think the timer stuff in memoize--wrap is pretty much the best we can do, so, as you pointed out, it should probably be added here, too. I'll still accept the PR without it, though.

alphapapa commented 6 years ago

Fix those three things (sharp-quote, nil value, and a test) and I'll accept this PR.

Sounds good.

Anyone want to open an Emacs bug for this?

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=32503

That being said, I think the timer stuff in memoize--wrap is pretty much the best we can do, so, as you pointed out, it should probably be added here, too. I'll still accept the PR without it, though.

Sounds good. I'll see if I can get it working. :)

Thanks, Chris.

alphapapa commented 3 years ago

Update: In 2020, this bug was filed, which is about this issue more generally: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=38912 It was merged with the report I filed, bug#32503. Then, a couple of weeks ago, Lars noted that the ECM no longer fails on Emacs 28, so it may have been fixed by changes to master: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=38912#109

skeeto commented 3 years ago

Adam, would you be interested in taking ownership of this library? We can update the MELPA recipe to point to a new location. I'd like to just set mine as archived. Any final change would just point the README at the new location, if any.

I originally wrote this nearly a decade ago, and since then my mind has turned strongly against libraries like this. IMHO, nobody should be using this package:

alphapapa commented 3 years ago

Thank you for asking, but I agree with your assessment. And although I've been interested in this package, I don't actually use it in any of my published packages. So I think it would be fine to archive it, perhaps suggesting it as an example of how memoization can be done in Elisp while noting the limitations.

djr7C4 commented 3 years ago

I wanted to comment here because I feel that this library still has its uses. Of course, if you as the author no longer find it useful and do not wish to maintain it then you shouldn't.

First, I think that reinventing the wheel for small dependencies also has its downsides such as duplication of effort, more bugs than if an established solution was used and more maintenance required.

I looked at the benchmarks linked above. It is indeed surprising that apply is so expensive and that does seem to limit the usefulness of this package for computationally expensive recursive functions. However, that is not the only case where this package is useful. For functions that have one (non-recursive call that is expensive), this package is still quite useful.

For example, I've used it mainly for functions that communicate over the network with a remote server as well as functions that have to create fairly heavy external processes that always return the same result. In these cases, the fact that apply is expensive is not a big deal and I have gotten huge performance gains from using this package. It would also be good for computationally expensive functions that have a few high cost calls.

This being said, I think that the idea you mentioned of using a static macro instead of a closure makes a lot of sense and seems like it would be better than the current solution given the limitations of apply.

alphapapa commented 3 years ago

I just stumbled across a few uses of memoize in my Emacs config that I'd forgotten about:

(require 'memoize)
  (defalias 'ap/directory-files-memoized
    (memoize (symbol-function 'directory-files) "1 minute"))

  ;; based on https://github.com/abo-abo/helm-org-wiki/blob/master/helm-org-wiki.el
  (defvar ap/helm-source-bindir
    `((name . "~/.bin")
      (candidates . ,(lambda ()
                       (ap/directory-files-memoized "~/.bin" t directory-files-no-dot-files-regexp)))
      (action . ,(lambda (x)
                   (find-file x)))
      (candidate-number-limit . 10)))

And:

(require 'memoize)
  (defalias 'ap/expand-projectile-project-files-memoized
    (memoize (symbol-function 'ap/expand-projectile-project-files) "5 minutes"))

  (defvar ap/helm-source-projectile-files-list
    ;; Copied from helm-source-projectile-files-list
    (helm-build-sync-source "Projectile files"
      :candidates (lambda ()
                    (ignore-errors
                      (ap/expand-projectile-project-files-memoized (projectile-project-root))))
      :fuzzy-match helm-projectile-fuzzy-match
      :keymap helm-projectile-find-file-map
      :help-message 'helm-ff-help-message
      :mode-line helm-read-file-name-mode-line-string
      :action helm-projectile-file-actions
      :persistent-action #'helm-projectile-file-persistent-action
      :persistent-help "Preview file"
      :candidate-number-limit 10)
    "Helm source definition for Projectile files.")

I put these in years ago and forgot about them. They significantly help the performance of my Helm "switch-to-anything" command by not enumerating files in these sources upon every invocation.

My Emacs config would be much poorer if not for this library (or I'd have had to write something like it myself). So maybe it shouldn't be archived, after all. Rather, maybe its pros and cons should be clearly documented so users can use it only when appropriate.

This doesn't mean that I'm volunteering to adopt the package as its sole maintainer. :) But I'd be glad to help as a co-maintainer.