pkulchenko / serpent

Lua serializer and pretty printer.
Other
549 stars 77 forks source link

custom / __tostring printer results incorrectly put into quotes #10

Open fab13n opened 10 years ago

fab13n commented 10 years ago

I'm evaluating the feasibility of using Serpent as Metalua's pretty-printer. I'm stuck with the fact that the result of __tostring metamethods are put into quotes. There's a similar issue with custom printers. For instance:

token_mt = { }
function token_mt :__tostring()
    return string.format("token(%s, %s)", self.tag, self.value)
end

function token(tag, value)
    return setmetatable({tag=tag; value=value}, token_mt)
end

stream = { token("Keyword", "local"), token("Id", "x") }

s = require 'serpent'
print(s.block(stream))
-- incorrectly produces {"token(Keyword, local)", "token(Id, x)"} which can't be deserialized

pp = require'metalua.pprint'
pp.print(stream)
-- produces { token(Keyword, local), token(Id, x) }, readily accepted by loadstring().
agladysh commented 10 years ago

I'd say that __tostring results should be put into quotes — they are, indeed, strings.

Maybe __serialize would help? (Didn't look into it closely.)

fab13n commented 10 years ago

@agladysh Adding this produces the same result:

token_mt.__serialize=token_mt.__tostring
print(s.serialize(stream,{ }))

Besides, it could be debatable whether this is a bug for __tostring, but the same issue exists with custom printers.

pkulchenko commented 10 years ago

@fab13n, I agree, it may make sense to handle __serialize differently. I think for consistency, __tostring should still be quoted.

Is it be sufficient for your purposes if __serialize outputs its result as is?

fab13n commented 10 years ago

I think tostring results shouldn't be quoted by Serpent itself. Otherwise, we can't tell the difference between a generated output and the corresponding literal string. In the example above I couldn't tell the difference between the outputs corresponding to the two following lines:

stream1 = { token("Keyword", "local"), token("Id", "x") }
stream2 = { [[token("Keyword", "local")]], [[token("Id", "x")]] }
assert(s.line(stream1)==s.line(stream2))

Sure, the version with mandatory quotes is valid Lua, but a valid Lua that's semantically different from the original one. It's arguably more treacherous, rather than safer. If users want to add quotes, they can add them in their __tostring() implementation. But the point of a pretty-printer is, well, to print something pretty, or at least readable. Compilability is mandatory for serialization, not for display.

As for using serialization instead of pretty-printing when one intends to pretty-print in a prettier way, I find this rather confusing :-) Besides, I don't even know whether I can use readability-improving options, such as indentation, with s.serialize().

pkulchenko commented 10 years ago

Yes, I do see this point and has struggled with the same considerations in the beginning. The main goal was to make the output consistent between serialization and pretty-printing as it's convenient in many situations take the "printed" output and paste it into a script. For example, when you use a console in the IDE to execute commands, is the result serialization or pretty-printing? It's probably mostly the latter, but I've seen on several occasions people taking the output from the console and pasting it into their code.

stream1 = { token("Keyword", "local"), token("Id", "x") }
stream2 = { [[token("Keyword", "local")]], [[token("Id", "x")]] }
assert(s.line(stream1)==s.line(stream2))

Right, but these are different values and I wanted them to look different. Similar to that, there is ambiguity with nil vs "nil", true vs. "true" and so on (which doesn't exist with the current output).

I can add a parameter to disable quoting of string values and would prefer keep the current option as the default.

fab13n commented 10 years ago

On Mon, Dec 30, 2013 at 7:01 PM, Paul Kulchenko notifications@github.comwrote:

assert(s.line(stream1)==s.line(stream2))

Right, but these are different values and I wanted them to look different.

My point is, they are identical with the current implementation, and I want them to look different!

pkulchenko commented 10 years ago

My point is, they are identical with the current implementation, and I want them to look different!

I understand, but they only look the same because your token() call returns its own representation.

There is still a difference between:

function token_mt :__tostring() return "true" end

-- and

function token_mt :__tostring() return true end

I pushed a change (in rawstring branch) that adds rawstring parameter:

print(s.block(stream, {rawstring = true}))

-- generates

{
  token(Keyword, local),
  token(Id, x)
} --[[table: 00C8C248]]

You can make it a default for all methods, so that your users don't need to specify it.

Will this work?

agladysh commented 10 years ago

I'm not sure if this is a correct approach. I think this should be configurable on per-object level. It makes little sense to me to configure a thing like this once per serialization call.

pkulchenko commented 10 years ago

I thought so as well; for example, adding something like __raw attribute to the metatable would signal that the result string should not be quoted (and we'd not need the serialization parameter), but I'd prefer not to clash with (possible) Lua values, like __mode. Maybe _raw or _rawstring?

agladysh commented 10 years ago

For example, when you use a console in the IDE to execute commands, is the result serialization or pretty-printing?

There is yet another use-case for pretty-printing while serializing — human-readable config files. This is a primary reason why in Lua Núcleo we have tpretty as well as tserialize (along with debug visualization of complex tables, of course).

So, we have at least three reasons to convert a Lua object to string:

Human-readable serialization and visualization should be highly configurable to conform to usage modes (single-line logging, multi-line config files) and to coding guidelines (indentation rules, line width etc.)

agladysh commented 10 years ago

Maybe I'm missing something (didn't study serpent design too close yet, sorry), but I'm not sure why __tostring and __serialize are supported by default at all.

I'd start with __serpent metamethod, which will have (read-only) access to the configuration and will allow object author emit custom serialized data — to be pasted literally to the resulting string.

Then I'd consider enabling __tostring support for human-readable visualization by default — but only there.

To me __serialize metamethod name looks too generic — there is no well-established protocol for it, as far as I'm aware. But maybe it will make sense to support it as a generic serialization fallback for serialization (and only there), accepting no parameters and returning a string to, again, be pasted literally to the output.

agladysh commented 10 years ago

Furthermore, all this metamethod handling is a sugar that will slow down the basic implementation. Why not leave it to user to sort out in a custom serialization callback function? (Well, OK, and maybe provide some useful common implementations in an auxiliary module...)

agladysh commented 10 years ago

Maybe _raw or _rawstring?

Please don't use such generic names to control module-specific behaviour.

If you really want a flag, use __serpent_raw or something.

agladysh commented 10 years ago

BTW, while we're considering __serpent (or whatever) custom API, there is a common usage pattern — at least in our code.

When I'm serializing to an output stream (say, stdout), I prefer to write there directly to be easier on GC. Compare Núcleo's tstr and tstr_cat.

pkulchenko commented 10 years ago

It makes little sense to me to configure a thing like this once per serialization call. vs. This is a primary reason why in Lua Núcleo we have tpretty as well as tserialize...

This is the same thing that serpent does with different methods and options it provides.

Using per-object configuration allows one to mix objects with different configurations in one serialization call, although I'm not sure how useful it is.

Human-readable serialization and visualization should be highly configurable to conform to usage modes (single-line logging, multi-line config files) and to coding guidelines (indentation rules, line width etc.)

Yes, serpent also supports all these combinations except limiting line width.

When I'm serializing to an output stream (say, stdout), I prefer to write there directly to be easier on GC. Compare Núcleo's tstr and tstr_cat.

This is a good point; it should be possible to write a wrapper around serpent that will allow "pulling" of the next token, allowing to do whatever desired with it and limiting memory use to the token being returned.

@fab13n, if you prefer not to add the rawstring option to the serialization call, then I can go back to the original proposal and to use __serialize call to provide literal serialization. In the current logic __serialize metamethod (if present) takes precedence over __tostring. I don't see a problem with rawstring option as it can be made a default, but @agladysh is right that in this case it will be applied to all __tostring/__serialize results.

agladysh commented 10 years ago

Yes, serpent also supports all these combinations except limiting line width.

Is this hard to implement? tpretty even places short (sub-)entries on a single line, and uses multi-line serialization for longer ones (the implementation is rather messy though).

pkulchenko commented 10 years ago

Is this hard to implement?

No, but it hasn't been a popular feature so far. Also, the original goal for Serpent was to provide "good enough" implementation (in the sense of "good" correctness and "enough" features for majority of users), while still staying small. It's not so small anymore, but I'm hesitant to add features "just in case". I usually look for specific use cases or user requests.

agladysh commented 10 years ago

BTW, there is yet another use-case "axis" for human-readable part — fast and good-enough vs. slow and really nice-looking.

When I'm writing data to a log file, I need it to be human-readable, but I need to write it really fast, even at expense of certain features (tstr).

When I'm generating code (like a configuration file), I need to do it prettily and I do not care for the speed (as much): tpretty.

Interestingly enough, tserialize is actually almost useless — loading back Lua code is too slow when compared with other machine-readable solutions (like, say, luatexts or msgpack).

agladysh commented 10 years ago

I usually look for specific use cases or user requests.

I'm considering dropping tpretty in favor of serpent someday. Lack of line limit is a blocker, since we need to generate code that is compliant to our coding guidelines. So, consider this a +1 for that feature. :)

pkulchenko commented 10 years ago

Lack of line limit is a blocker, since we need to generate code that is compliant to our coding guidelines. So, consider this a +1 for that feature. :)

That's good as a feature request ;).

Good point on fast vs. pretty, although it's difficult to say ahead of time what features can be skipped to make things fast(er).

agladysh commented 10 years ago

Using per-object configuration allows one to mix objects with different configurations in one serialization call, although I'm not sure how useful it is.

Well, to me __tostring in serialization/visualization context seems to be useful only as a clutch for a quick-and-dirty visualization. Given that, I would say that the per-call configuration is meaningless.

I see several major use-cases (examples in pseudocode, I've actually needed everything listed here at one time or another):

  1. Mutate object on serialization.

    __serpent = function(self)
     -- Rest of the fields are ephemeral
     return { x = self.x, y = self.y } 
    end
  2. Provide a constructor function for serialization.

    __serpent = function(self, serpent)
     local data = tclone(self)
     data.tag, data.id = nil, nil
     return self.tag, " ", serpent.short_string(self.id), serpent.newline, data
    end
    
    --- For self = { tag = "foo:bar", id = "baz", quo = 42 }:
    ---
    --> foo:bar "baz"
    --> {
    -->   quo = 42;
    --> }
  3. Control formatting for human-readable serialization.

    __serpent = serpent:format_fields { type = serpent.long_string }
    
    --- To be put on `data` above to get:
    --> foo:bar "baz"
    --> {
    -->   type = [[mystring]];
    -->   quo = 42;
    --> }
  4. Provide custom visualization.

    __serpent = function(self, serpent)
     if not self.tag then
       return self
     end
    
     return "`", self.tag, " ", serpent:comma_list(self)
    end

Not a single use-case that I see now requires non-raw pasting.

What am I missing?

agladysh commented 10 years ago

Furthermore, __tostring is a potential source of bugs here — it is so generic. If I'm using serpent for serialization, I must add __serialize to every object that might have __tostring in its metatable — even on a object from foreign module. And there is no saying what metatamethods and for what reason foreign module objects would have.

(I also updated the post above a bit.)

agladysh commented 10 years ago

Item 3 above also may include custom field order for an object, BTW, — useful for DSLs, where field order may be determined by the purpose of the field (say, first common fields, then specialized for this specific construct), thus be dependent on tag field value...

agladysh commented 10 years ago

(Edited that big post with use-cases again.)

agladysh commented 10 years ago

...And here is a problem with using a metamethod instead of a callback in serialization. Generally, you want to save the object in such way that, when loaded back, it will be able to save itself again.

This means that this doesn't cut the mustard:

__serpent = function(self)
  return { x = self.x, y = self.y } 
end

You always need a constructor that will set a metatable:

__serpent = function(self)
  return "make_foo", " ", { x = self.x, y = self.y } 
end

This is not a problem for visualization — you don't need to load anything back. But for serialization you're better off using a callback function. This way you would not need to set any metatables on load, and can just emit good old table literals.

pkulchenko commented 10 years ago

Well, to me __tostring in serialization/visualization context seems to be useful only as a clutch for a quick-and-dirty visualization. Given that, I would say that the per-call configuration is meaningless.

It's actually something that many users find useful/helpful. Some use it for their own values that already provide __tostring (pkulchenko/ZeroBraneStudio#77) and some use it to serialize and print their userdata (for example, Marmalade Quick).

I see several major use-cases (examples in pseudocode, I've actually needed everything listed here at one time or another):

That's a lot to review and digest; thank you for the summary. What's __serpent in these examples?

agladysh commented 10 years ago

__serpent is a fancier __serialize metamethod — I used a different name here so we'll not be bound by an existing contract while discussing the matter.

agladysh commented 10 years ago

It's actually something that many users find useful/helpful. Some use it for their own values that already provide __tostring (pkulchenko/ZeroBraneStudio#77) and some use it to serialize and print their userdata (for example, Marmalade Quick).

The usage of the __tostring in general is not useful for serialization. Look at luasocket, for example. You can't load that back.

True, in certain closed ecosystems it is possible to institute a stricter contract. But in general, IMO, a serialization code should not look at __tostring at all — unless user explicitly configures it to do that.

Visualization is debatable.

But, if I were writing a debugger, I would, first of all, provide my own callback, so users could specialize their code. Besides, what if __tostring for a given object has side-effects, for example? It is not forbidden by the general __tostring contract. So, if mobdebug uses __tostring for value visualization, IMHO, this is a dangerous misfeature that can cost a careless and unlucky user many hours of pain.

pkulchenko commented 10 years ago

serpent is a fancier serialize metamethod — I used a different name here so we'll not be bound by an existing contract.

That's what I suspected, but I thought you were arguing against adding metamethods to all the objects: "If I'm using serpent for serialization, I must add serialize to every object that might have tostring in its metatable — even on a object from foreign module."

It seems like there are three main options for implementing that:

  1. Using __tostring; this should be good enough for 90% of the users, maybe more. The current issue (at least for Fabien) is that the output of __tostring is not literal.
  2. Using special metamethod, like __serpent (or existing __serialize). This should cover all the special cases (and should use literal output), but needs to be installed for every object that requires that treatment.
  3. Using special callback, probably based on a type. This allows the callback to be specified only once (instead of attaching the metamethod) and can be hidden as one of the default options (in Fabien's case). This is similar to the current custom option, but it's only triggered for "plain" tables.
agladysh commented 10 years ago

(BTW, please don't let my ramblings to distract you from giving Fabien a working solution for his immediate problem :-) We can discuss theory for ever, but it is practice that is important.)

pkulchenko commented 10 years ago

But, if I were writing a debugger, I would, first of all, provide my own callback, so users could specialize their code. > Besides, what if tostring for a given object has side-effects, for example? It is not forbidden by the general tostring contract. So, if mobdebug uses __tostring for value visualization, IMHO, this is a dangerous misfeature that can cost a careless and unlucky user many hours of pain.

In what way? This is not too different from using print(foo).

agladysh commented 10 years ago

That's what I suspected, but I thought you were arguing against adding metamethods to all the objects: "If I'm using serpent for serialization, I must add serialize to every object that might have tostring in its metatable — even on a object from foreign module."

This was an argument against __tostring on serialization.

pkulchenko commented 10 years ago

BTW, please don't let my ramblings to distract you from giving Fabien a working solution for his immediate problem :-) We can discuss theory for ever, but it is practice that is important.

;). I'll give Fabien a chance to respond to all the suggestions we discussed...

fab13n commented 10 years ago

If we're talking about serpent-specific callbacks, having a 'custom' option which doesn't add quotes around my stuff if I didn't ask for them is good enough to me.

I still believe that forcing quotes around tostring results is wrong, but I can live with other peole being wrong :) On Dec 30, 2013 11:31 PM, "Paul Kulchenko" notifications@github.com wrote:

serpent is a fancier serialize metamethod — I used a different name here so we'll not be bound by an existing contract.

That's what I suspected, but I thought you were arguing against adding metamethods to all the objects: "If I'm using serpent for serialization, I must add serialize to every object that might have tostring in its metatable — even on a object from foreign module."

It seems like there are three main options for implementing that:

  1. Using tostring; this should be good enough for 90% of the users, maybe more. The current issue (at least for Fabien) is that the output of tostring is not literal.
  2. Using special metamethod, like serpent (or existing serialize). This should cover all the special cases (and should use literal output), but needs to be installed for every object that requires that treatment.
  3. Using special callback, probably based on a type. This allows the callback to be specified only once (instead of attaching the metamethod) and can be hidden as one of the default options (in Fabien's case). This is similar to the current custom option, but it's only triggered for "plain" tables.

— Reply to this email directly or view it on GitHubhttps://github.com/pkulchenko/serpent/issues/10#issuecomment-31373122 .

pkulchenko commented 10 years ago

If we're talking about serpent-specific callbacks, having a 'custom' option which doesn't add quotes around my stuff if I didn't ask for them is good enough to me.

So, what I have in the rawstring branch would then work for you?

print(s.block(stream, {rawstring = true}))
-- generates
{
  token(Keyword, local),
  token(Id, x)
} --[[table: 00C8C248]]
agladysh commented 10 years ago

But for serialization you're better off using a callback function. This way you would not need to set any metatables on load, and can just emit good old table literals.

Come to think about it, there is a catch here. Downside of a callback function is that it is centralized — and forces tight code coupling. If your objects come from different modules, you have to be aware of each one of them in the callback.

Now you do dispatch by metamethods, which is nice when you have a metatable. And you often do, actually, when serializing a complex object, — despite what I said earlier. So, it is probably better to keep supporting metamethod dispatch.

The argument against metamethods comes from the use-case when I don't have a "natural" metatable on the object and don't want to add it — DSL-ization of the data (see use-cases 1-3 above).

I do my dispatch by the tag field, and the serialization code should do that too.

For that use-case a centralized callback function is OK, I can just skip metamethods altogether.

However, in general my object factories do not use metatables too, unless it is absolutely necessarily, — in plain Lua 5.1 it is usually faster to manually fill a table with methods than to call setmetatable.

So, for example, what if I want to put my serialization code to the :save() method instead of __serialize metamethod? And what if I want to change the dispatch behavior for "my" objects only?

This is getting a bit contrived though...

agladysh commented 10 years ago

So, if mobdebug uses __tostring for value visualization, IMHO, this is a dangerous misfeature that can cost a careless and unlucky user many hours of pain.

In what way? This is not too different from using print(foo).

Don't know about others, but I wouldn't intentionally call print() on a variable that I would suspect that it holds a non-trivial value.

Anyway, a good debugger, IMHO, does not have any right to affect the value that is being inspected in any way without explicit user command.

What if I'm debugging __tostring itself?

__tostring = function() _G.table = nil end;
__tostring = function() do_segmentation_fault() end;
__tostring = function() return 0/0 end; -- Assuming MSVC and bad coprocessor flags.
__tostring = function() eat_all_memory() end; 
__tostring = function() error("oops!") end;

To be even more realistic, what if __tostring has an internal cache that you would populate or invalidate by your call?

A debugger does not have a right to assume that __tostring does not have any side effects. Usually it does not. But it is not enough.

fab13n commented 10 years ago

My use-case is handled adequately by custom. I still can't follow your logic behind not making "rawstring" the default behavior, but that's not critical.

One design consideration, though: __tostring is a pretty-much standard metamethod, since it's used by tostring() which is a standard function. But adding more meta-method, such as __serialize and even more so __serpent, should be done with some restraint: it adds some coupling between serpent and libraries which could have been left independent. I'd even say it encourages monkey-patching of metatables.

agladysh commented 10 years ago

@fab13n sure, __serpent is tight coupling. But I do not see how it can be avoided — unless you happen to invent a generic API somehow.

Serialization of complex business-logic object is a complicated and messy business.

If you want it to be performance-effective, you must use advanced features, which almost always are specific to the serialization library.

Look at how many use cases we listed above. Most need their own API.

Add to the mix low level data formats (i.e. JSON instead of Lua), format versioning and support for older versions and all other stuff — and you'll see why I say that __serialize is a bad name, too generic for for a library that does a specific and limited kind of serialization.

And __tostring, as I elaborated above, IMO, is not suitable for serialization at all.