mbutterick / pollen-users

please use https://forums.matthewbutterick.com/c/typesetting/ instead
https://forums.matthewbutterick.com/c/typesetting/
52 stars 0 forks source link

Reuse namespace #43

Open sorawee opened 4 years ago

sorawee commented 4 years ago

I did an experiment by modifying Pollen to use the same namespace instead of creating a new one for every file. For pollen-tfl with one thread, the rendering time after raco pollen reset reduces from 332s to 121s.

Of course, the behavior would be different. In particular, if files like pollen.rkt have a side-effect (say, mutate a global variable), then the side-effect would persist across rendering multiple files. However, for projects that don't have side-effects (which are probably the majority?), I think this is a performance boost for free?

Perhaps there should be an option to allow using the same namespace?

sorawee commented 4 years ago

Another idea that I want to throw out here.

I tried the following snippet in pollen-tfl after making Pollen reuse namespace:

#lang racket

(require (for-syntax racket/string
                     racket/format))

(define-syntax (gen-all stx)
  (define reqs
    (for/list ([f (directory-list)] #:when (string-suffix? (~a f) ".pm"))
      (with-syntax ([f (~a f)]
                    [x (gensym)])
        #'(begin (module x racket
                   (require f)
                   (println doc))
                 (require 'x)))))
  #`(begin #,@reqs))

(gen-all)

It takes 65s to run, without caching. This suggests that if we have a better dependency manager, then we can require a bunch of files at once, saving the cost of dynamic-requires.

(The running time reduces even further to 28s after excluding toc.html.pm, because the file loads other files which result in dynamic-require. The dynamic-require cost would be eliminated with a good dependency manager and caching).

mbutterick commented 4 years ago

If you can make Pollen go faster, great. Making it go faster while preserving its features is the hard part, I have found.

I think this is a performance boost for free?

This is what I thought when I added parallel rendering. It was less true than I hoped. 🤯

Removing steps from an expensive computation is a great way to save time. But it’s only “free” if you know for sure that skipping those steps never leads to incorrect results. Attaching permanent caveats — “it works, if you know that X Y and Z are true” — leads to despair, which is not free.

IIRC the reason fresh namespaces were necessary was to support dynamic re-evaluation during an interactive project server session. Otherwise, dynamic-require caches its results, and you have to restart the project server to see changes.

You may be right, however, that certain simplifications are possible during a non-interactive session (say, when using raco pollen render, because we can safely assume that the source is not changing from start to finish.)

mbutterick commented 4 years ago

For instance. The reason, say, Scribble can be faster on large documents is that all the component source files are pulled into one master source — this one source is compiled & evaluated — and then multiple pages are emitted as output. Pollen, by contrast, has one source per output file, each separately evaluated.

OTOH Scribble can do this because it exerts more control over how the document is structured. You can import your own functions to a Scribble source. But it doesn’t permit the granularity of control that Pollen does. Costs vs. benefits.

I’ve considered, at least, whether Pollen could similarly “gang” files together and consolidate evaluations. For instance, by packing a number of source files into another as submodules. But I don’t see why this would change anything, aside from repositioning the pieces on the board. A module has the same evaluation costs regardless if it’s a submodule or standalone source file.

As a middle approach, I’ve also considered whether Pollen could introduce a concept of a one-to-many page. This would be faster to evaluate, because it would be a single evaluation (like a Scribble source). But it would be a distinct concept within Pollen from the current preprocessor / Markdown / markup files.

The problem with a one-to-many file type is that it makes dynamic refresh annoying, because now you have to refresh a possibly huge source in order to refresh one small part.

mbutterick commented 4 years ago

The other issue with one-to-many page generation is that I have never wanted this once for my own work. For me, the value of Pollen is exactly that it is so luxuriously indulgent. Every page triggers a full program evaluation! Where else can you get this? Nowhere.

By contrast, the one-to-many publishing model is well covered by other tools — Scribble, or Frog, or a zillion other static-site generators beyond.

So, though I am always interested in making Pollen faster, it only makes sense if the new technique supports the core theory of operation. Which is why, so far, I have focused more on file-based caching and more recently parallel processing. I’m sure there are other good ideas yet to be discovered.

mbutterick commented 4 years ago

if files like pollen.rkt have a side-effect (say, mutate a global variable), then the side-effect would persist across rendering multiple files.

What would be a test case that demonstrates this behavior? The fix in #49 doesn’t break any existing Pollen tests, nor any of my own projects. Moreover, Pollen doesn’t guarantee a clean namespace for rendering — like I say, it’s more of a necessity to support dynamic refresh during an interactive session.

My hunch is that the situation doesn’t arise much in the wild, because Racket naturally deters use of global variables and mutation.

sorawee commented 4 years ago

Consider:

;; a.html.pm
#lang pollen
;; b.html.pm
#lang pollen
;; pollen.rkt
#lang racket
(provide root)
(define x 0)
(define (root . xs)
  (set! x (add1 x))
  (number->string x))

Prior the namespace reuse, raco pollen render . will create the following files:

<html><head><meta charset="UTF-8"/></head><body>1</body></html>
<html><head><meta charset="UTF-8"/></head><body>1</body></html>

After the namespace reuse, raco pollen render . will create the following files:

<html><head><meta charset="UTF-8"/></head><body>1</body></html>
<html><head><meta charset="UTF-8"/></head><body>2</body></html>
mbutterick commented 4 years ago

I think I would call this a case of nondeterministic compilation, in which case Pollen’s guarantees needn’t be any stronger than Racket’s. For instance, if we convert these files to Racket modules, we’d get the same weird behavior:

;; a.rkt
#lang racket
(require "base.rkt")
(provide x)
(define x (f))
(println x)
;; b.rkt
#lang racket
(require "base.rkt")
(provide x)
(define x (f))
(println x)
;; base.rkt
#lang racket
(provide f)
(define x 0)
(define (f)
  (set! x (add1 x))
  (number->string x))

Suppose these all live in collection foo. Running racket -l foo/a or racket -l foo/b will print 1. But running racket and then doing (require foo/a) and (require foo/b) (or vice versa) will produce 1 then 2.

sorawee commented 4 years ago

I think mutation like this is quite common when one wants to communicate across tags. E.g., making footnotes. There's a way to make it work by dealing with things in the root function instead, but that's a whole program restructuring. As a concrete example of how these mutation is useful:

;; a.html.pm
#lang pollen
â—Šinc-x[] or â—Šinc-x[]
;; b.html.pm
#lang pollen
â—Šinc-x[] and â—Šinc-x[]
;; pollen.rkt
#lang racket
(provide inc-x)
(define x 0)
(define (inc-x)
  (set! x (add1 x))
  (number->string x))

And this would work prior namespace reuse, with a.html having content "1 or 2" and b.html having content "1 and 2". However, after namespace reuse, it would be "1 or 2" and "3 and 4"

Note that I'm not saying that producing "1 or 2" and "3 and 4" are wrong. It's an acceptable behavior, but there should be a way to make it possible to produce "1 or 2" and "1 and 2".

One easy way is to fix this problem is to create a tag named reset that (set! x 0) and put reset at the beginning of every Pollen file, but there's an alternative approach that I like more.

One feature that I think will be very useful is some sort of #%module-begin macro for Pollen programs (it actually doesn't need to be a macro, see details below). Right now, the topmost level root effectively must be a function because of how it's used: (apply root-proc xs) at https://github.com/mbutterick/pollen/blob/master/pollen/private/main-base.rkt#L34.

However, this means root will be called as a last function in Pollen program evaluation. Sometimes, though, what I want is an ability to have root set things up. So my workaround is the following:

;; pollen.rkt
#lang racket

(provide (all-defined-out))
(require racket/splicing)

(define current-x (make-parameter 0))

(define-syntax-rule (my-root xs ...)
  (splicing-parameterize ([current-x 0])
    xs ...))

(define (inc-x)
  (current-x (add1 (current-x)))
  (number->string (current-x)))

Then:

;; a.html.pm 
#lang pollen

â—Šmy-root{
  â—Šinc-x[] or â—Šinc-x[]
}
;; b.html.pm
#lang pollen

â—Šmy-root{
  â—Šinc-x[] and â—Šinc-x[]
}

will deterministically produce:

<html><head><meta charset="UTF-8"/></head><body><root>1 or 2</root></body></html>
<html><head><meta charset="UTF-8"/></head><body><root>1 and 2</root></body></html>

But as you can see, I need to wrap everything in my-root to make this work. It would be nice if Pollen has a special symbol like root whose dynamic extent covers the entire Pollen program evaluation.

But OK, perhaps macro is too demanding, then another possibility is thunking. That is, my-root will consume an argument f which, when invoked, will evaluate Pollen program. To make it consistent with the current behavior, my-root by default would be:

(define (my-root f) (f))

But users are allowed to override my-root to something like:

(define (my-root f)
  (parameterize ([current-x 0]) (f)))
sorawee commented 4 years ago

Note: I edited the above comment a lot. You might want to read it from GitHub instead of email.

mbutterick commented 4 years ago

One easy way is to fix this problem is to create a tag named reset that (set! x 0) and put reset at the beginning of every Pollen file

Yes — moreover, this is the Rackety way to go about it, and using fresh namespaces would be both perverse and slow.

However, this means root will be called as a last function in Pollen program evaluation. Sometimes, though, what I want is an ability to have root set things up

The idea of a function named, say, init that can be used for setup tasks at the start of a page render is interesting. But something like #%module-begin is a little different. Can’t you already do that, by making your own Pollen-derived #lang?

sorawee commented 4 years ago

The idea of a function named, say, init that can be used for setup tasks at the start of a page render is interesting. But something like #%module-begin is a little different. Can’t you already do that, by making your own Pollen-derived #lang?

init would suffice for (set! x 0) solution, but would not suffice for parameterize solution. When I thought about this, I wanted to find the most general solution that can be used in various settings. That being said, if you think init would be more suitable, I would welcome it. It's better than nothing.

mbutterick commented 4 years ago

I’m not averse to something like #%root-begin — I just try to avoid macro solutions where possible. I’ll think about how it could be done (unless you want to prototype it into a PR)

sorawee commented 4 years ago

But OK, perhaps macro is too demanding, then another possibility is thunking. That is, my-root will consume an argument f which, when invoked, will evaluate Pollen program. To make it consistent with the current behavior, my-root by default would be:

(define (my-root f) (f))

But users are allowed to override my-root to something like:

(define (my-root f)
  (parameterize ([current-x 0]) (f)))

Would this be acceptable?

mbutterick commented 4 years ago

Why not try moving root to a position where it can be either a function or a macro. That was your first suggestion. That seems more flexible than the thunking idea.

otherjoel commented 4 years ago

Just chiming in to say I do use mutable hash tables in my pollen.rkt for footnotes and link references, and when doing parallel renders many of my pages now have footnotes from other pages.

However, I’m not complaining or asking to revert. I am persuaded that the new way has benefits. I just want to understand what the implications are right now for state that I want preserved between tag function calls but not across pages when doing parallel renders. Are parameters no longer sufficient for this purpose?

I understand that refactoring so that dealing with everything inside root is one way to do this; I could also prefix my hash keys with some unique per-page value (like here-path) to isolate each page’s values from each other (specifically in the case of hash tables).

mbutterick commented 4 years ago

Right — you’ll need to manage the state for each page explicitly, rather than relying on that behavior as a side effect of fresh namespaces.

In general, using here-path to key this data is a good idea, since that's guaranteed to be unique for each source file.

Concatenating the keys would work, though it makes per-page queries a little messy. One could also convert a footnote hash into a hash with subhashes: the top level is indexed by here-path, and then the subhashes are indexed by footnote number.

#lang racket
(require pollen/core)

(define fn-hash (make-hash))
(define (fn txt)
  (define page-path (hash-ref (current-metas) 'here-path))
  (define fn-hash-page (hash-ref! fn-hash page-path make-hasheq))
  (define fn-count (add1 (length (hash-keys fn-hash-page))))
  (hash-set! fn-hash-page fn-count txt)
  (format "~a is fn ~a" txt fn-count))
sorawee commented 4 years ago

Why not try moving root to a position where it can be either a function or a macro. That was your first suggestion. That seems more flexible than the thunking idea.

I think it's the same reason why root in the current Pollen exists. One hypothetical design of Pollen is to require people to wrap the whole content up in the top-level tag explicitly instead of relying on the implicit root, but that would be very tedious, and that's why I think you choose to use the implicit root tag instead.

otherjoel commented 4 years ago

I’ve been testing and working to ensure I fully understand the implications of this change. Tell me if I have this correct:

And finally:

As to this last bit, consider this MVE. I could not find a sequence of raco pollen commands that would get the (template) line of the rendered output to say anything other than Result: 1.

mbutterick commented 4 years ago

The fresh namespace is only necessary in the project-server context, because that’s the only way to make sure that all updated source files (incl "pollen.rkt") are properly incorporated in a render (originally it was the fix for https://github.com/mbutterick/pollen/issues/64).

Your description of the behavior seems right except for the last point. It would be more accurate to say that after this change, a Pollen source may or may not be evaluated in its own namespace, just as currently, it may or may not be evaluated in parallel. In both cases, the programming should not depend on any side effects of these environments.

That said, one can still avoid parallel processing — possibly useful for projects that want a guaranteed evaluation order. Likewise, I could add a command-line switch or setup value to restore the fresh-namespace behavior for those who prefer the consistency.

Your code example depends on mutation of a global variable, which is always going to be troublesome.