Some things apparently don't survive serialization ("de-marshaling"?)

greghendershott commented 3 years ago

Maybe the proper terminology here is "de-marshaling" (from byte code)?

Anyway this is what I'm seeing (possibly doing something wrong):

experiment.rkt:

#lang racket/base

(require racket/path
         racket/runtime-path
         syntax/modread
         drracket/check-syntax)

;; Let's fully expand a source file in a fresh namespace, and use the
;; current-compile / quote-syntax approach to serialize that to `ob`.
(define ob (open-output-bytes))
(define-runtime-path src-path "experiment.rkt") ;e.g. this file
(define dir (path-only src-path))
(parameterize ([current-load-relative-directory dir]
               [current-directory               dir])
  (with-module-reading-parameterization
    (λ ()
      (with-input-from-file src-path
        (λ ()
          (port-count-lines! (current-input-port))
          (define stx (read-syntax))
          (parameterize ([current-namespace (make-base-namespace)])
            (define exp-stx (expand stx))
            (println (syntax-property exp-stx 'module-body-context)) ;; print this to compare below
            (define compiled ((current-compile) `(,#'quote-syntax ,exp-stx) #f))
            (write compiled ob)))))))

;; Now, in a fresh namespace -- not the one in which we originally
;; expanded -- let's read and eval the bytes, and see if some things
;; were preserved.
(parameterize ([current-namespace (make-base-namespace)])
  ;; This indeed works, as far as loading a syntax object.
  (define exp-stx (eval (parameterize ([read-accept-compiled #t])
                          (read (open-input-bytes (get-output-bytes ob))))))
  ;; [Problem 1]: Are all of the syntax properities preserved? Alas, no, this
  ;; returns #f:
  (println (syntax-property exp-stx 'module-body-context))
  ;; [Problem 2]: Can we use the loaded expanded syntax to, for
  ;; example, give to drracket's show-content? Alas, no, this errors:
  ;;
  ;; namespace mismatch: bulk bindings not found in registry for module: #<resolved-module-path:"/home/greg/racket/8.0.0.12/share/pkgs/drracket-tool-lib/drracket/check-syntax.rkt">
  (show-content exp-stx
                #:fully-expanded? #t
                #:namespace       (current-namespace)))

This prints 2 values plus an error:

#<syntax:/home/greg/src/racket/pdb/experiment.rkt:1:6 racket/base>
#f
; namespace mismatch: bulk bindings not found in registry for module: #<resolved-module-path:"/home/greg/racket/8.0.0.12/share/pkgs/drracket-tool-lib/drracket/check-syntax.rkt">
; Context (plain; to see better errortrace context, re-run with C-u prefix):
;   /home/greg/racket/8.0.0.12/share/pkgs/drracket-tool-lib/drracket/private/syncheck/traversals.rkt:184:2 level+tail+mod-loop
;   /home/greg/racket/8.0.0.12/share/pkgs/drracket-tool-lib/drracket/private/syncheck/traversals.rkt:184:2 level+tail+mod-loop
;   /home/greg/racket/8.0.0.12/share/pkgs/drracket-tool-lib/drracket/private/syncheck/traversals.rkt:48:10 expanded-expression
;   /home/greg/racket/8.0.0.12/share/pkgs/drracket-tool-lib/drracket/check-syntax.rkt:56:0 show-content
;   /home/greg/racket/8.0.0.12/collects/racket/contract/private/arrow-val-first.rkt:555:3

greghendershott commented 3 years ago

In my experience, to do anything useful with fully-expanded code you need the namespace in which it was originally expanded. Which means the only "cache" that works is in-memory -- the namespace and the expand syntax, both. But that chews through memory very quickly. (e.g. https://github.com/greghendershott/racket-mode/issues/512)

As a result I've shifted to thinking about ways to run all the analyses up-front, eagerly, and save those interesting result in an on-disk database. Which is what I started to experiment with in https://github.com/greghendershott/pdb (savings definitions and references).

If indeed it's possible (or could become possible) to fully serialize the fully-expanded code and other information from the namespace and/or module registry? That would be great!
If not, then we could also explore a way for tools to "register" a "hook" to be called when expansion of a file is complete. The hook would get the syntax object for the fully-expanded code, plus the namespace. Each tool could then do whatever analysis it wants to do, and serialize the results however it wants.

mflatt commented 3 years ago

It would be possible to serialize syntax objects in a way that preserves bulk bindings — at the expense of not sharing the exporting module's information when the syntax object are deserialized, but non-sharing may be what you want here.

I imagine that non-preserved syntax properties like 'origin are also an issue. That seems a little tricker, since some non-perserved syntax properties are probably non-serializable. If the serialization function took a set of non-perserved keys to treat as preserved, would that work?

greghendershott commented 3 years ago

It would be possible to serialize syntax objects in a way that preserves bulk bindings — at the expense of not sharing the exporting module's information when the syntax object are deserialized, but non-sharing may be what you want here.

That sounds good.

I don't understand what "the exporting module's information" means, so I don't know whether to think that's good or bad.

I imagine that non-preserved syntax properties like 'origin are also an issue. That seems a little tricker, since some non-perserved syntax properties are probably non-serializable.

A quick glance at traversals.rkt shows it uses a half dozen or so syntax properties. I don't know how many of those are serializable.

If the serialization function took a set of non-perserved keys to treat as preserved, would that work?

I think so?

I'm not sure how to handle ones that turn out to be non-serializable. Maybe there needs to be required-keys which if non-serializable raises an exception, and optional-keys where it just skips --- or something like that?

I'm just guessing there might be uses where it's acceptable to proceed with some missing. (I'm not sure if that applies to drracket/check-syntax; @rfindler knows better if an "incomplete" analysis is better than nothing -- or worse than nothing. But I think Robby's idea was to build something that could also support other uses.)

greghendershott commented 3 years ago

I'm not sure how to handle ones that turn out to be non-serializable. Maybe there needs to be required-keys which if non-serializable raises an exception, and optional-keys where it just skips --- or something like that?

Not to over-think this, but I can imagine values that aren't serializable -- but a function could transform them into a value that is. The substitute value might be "impoverished", but it might be better than nothing, and enough to support some use case.

So maybe the "ideal" would be something in the spirit of the not-found argument to hash-ref. But in this case, not-serializable.

Not supplied, which raises an exception.
A non-function value, which is returned.
A function. Here I guess the function is called with the value, and should return a serializable value instead, which is what gets written.

Like I said, maybe over-thinking it.

rfindler commented 3 years ago

It surely seems like this library could publish which properties it serializes and we could add more over time, as they became needed/useful. That's at least a minimum choice that sounds workable. There may be better choices tho.

I think all of the syntax properties that check syntax currently uses are serializable.

greghendershott commented 3 years ago

@rfindler Some of the syntax property values are identifiers. A piece of syntax that is identifier? can be serialized. But do we know "how much" of an identifier (in a syntax property value) is serialized by compile, and is recovered by the (eval (read __)) deserialization?

I'm wondering about information about an identifier beyond its symbol datum and srcloc --- things like scope, and operations like comparing identifiers for equality or giving them to identifier-binding.

(I'm not claiming it won't or can't work. I have a fuzzy understanding of what's involved. I'm genuinely asking, to double-check.)

mflatt commented 3 years ago

Serialization currently preserves all of that, except for "bulk bindings", which are included only by reference to a providing module. So, a key piece is keeping bulk bindings with the serialized object instead of just a reference to the module.

rfindler commented 3 years ago

What's an idea of what's inside a "bulk binding"? Is it like when I call identifier-binding on an identifier, the answer might be in a "bulk binding" if it is an imported identifier and so serialization (without us doing something special) would lose that?

If so, given what the rest of the code in this repo is doing, it may make sense for us to maintain a similar kind of reference (since anytime we have a fully expanded thing of some module we also have all its imports too). Not sure how this would work exactly tho :)

mflatt commented 3 years ago

Yes, require bindings can be applied in "bulk" form, which not only binds N exports at a time, but shares binding-representation information among syntax objects in contexts that require from the same module. Sharing is just a constant-factor improvement in practice, though, and the resolution of shared information is deeply tied to the module-declaration machinery. So, it's probably better to avoid it for your purposes.

rfindler commented 3 years ago

Okay!

mflatt commented 3 years ago

I've added syntax-serialize and syntax-deserialize.

The #:provides-namespace argument gives you control over the use of bulk bindings. Set it to #f (or an empty namespace) to make a serialized form independent of bulk bindings. Or, if you decide to track dependencies and take advantage of bulk-binding sharing by loading module declarations into a namespace, you have relatively fine-grained control through #:provides-namespace.

The #:preserve-property-keys argument lets you specify extra property keys to treat as preserved.

I don't think you'll need #:base-module-path-index.

rfindler / fully-expanded-store

Some things apparently don't survive serialization ("de-marshaling"?) #1