nuttycom / pickle

A delightful little markup language.
http://nommit.com/pickle
5 stars 0 forks source link

issues when storing data #2

Open jdegoes opened 13 years ago

jdegoes commented 13 years ago

In XML, there is no syntax-level way to distinguish between a record (whose children are fields) and a collection (whose children are elements of the collection). In JavaScript, there are two first class constructs that distinguish between these two constructs: objects and arrays.

In XML, what people sometimes do is invent useless tags such as in order to capture the proper semantic, and even then, there is no syntax-level guarantee that non tags won't be stored together with the magical tags, leading to a non-sensical interpretation.

It will be similarly painful to represent data in Pickle without a first-class notion of collection.

nuttycom commented 13 years ago

That seems like the job of a tag handler, to enforce collection semantics. I don't think it belongs at the syntax level, since every document is implicitly a collection to begin with.

nuttycom commented 13 years ago

To elaborate, the only situation where I can see special syntax being desirable is in the case where you want to have a collection of primitive values. But in this case, one could just have the tag handler for the collection do whatever sort of parsing is appropriate to decompose its value into a collection of values. I'll have to give this some more thought, because I can see arguments both ways. JSON doesn't have any semantics except for the distinction between collections and records; Pickle doesn't have any built-in semantics at all; instead it provides a framework where you provide the semantics that you need; I provide some sensible default handlers but you don't have to use them. Will require more thought.

jdegoes commented 13 years ago

The issue is that when modeling data, records and collections are so ubiquitous, that syntax-level support for them dramatically simplifies common use cases. Yes, in XML and Pickle, everything is a container with children. But in data modeling, there are two base container types, which are substantially different, and trying to unify collections as just another kind of record results in all sorts of chaos (such as the above).

As another example:

In XML, the only way to store data is to wrap it in a named tag. Normally, this is not a problem, but it is a serious problem for the root node. There can only be one root node, so information communicated by the tag name is often useless.

This requirement is so inconvenient in XML libraries that usually XML libraries introduce some kind of bogus node, which is not a true node (it cannot be serialized to XML), and which has no name. This node represents a collection of XML nodes, and it can be manipulated similarly to named nodes with children.

This design smell is more evidence in support of my position that the lack of collections in XML is a mistake.

Personally, I really like the clean syntax of Pickle (it's beautiful!), but would not choose it for serialization unless it acquired syntax-level support for collections. It needn't clutter syntax, e.g.:

#[123 Foo Street | 123 Bar Avenue]

@[person |]
  @addresses[#[123 Foo Street | 123 Bar Avenue]]
@[/person]
nuttycom commented 13 years ago

Interesting. Your example is the case that I was talking about, where you have a collection of primitive values without additional metadata, except that provided by their container (by the way, you can use @[person] as the opening tag if you're not going to have any metadata.) And I really like the look of the syntax you're proposing.

Okay, I'm 95% convinced. :)

nuttycom commented 13 years ago

I've added support for this style usage on the primitives_syntax branch. I'm still not 100% convinced; at very least I think that the syntax needs tweaking before I merge it to master because this breaks symmetry in a way -- elsewhere, the pipe is always used as a delimiter between data and metadata, whereas here it's between data and data. Maybe '#' should be used as the separator instead?

I'm also going to try the approach of implementing this functionality entirely in a tag handler.

Damn it, I was supposed to be doing taxes! :)

nuttycom commented 13 years ago

How would you feel about this:

 123 Foo Street @@ 123 Bar Avenue

 @[person]
    @addresses[123 Foo Street @@ 123 Bar Avenue]
 @/
jdegoes commented 13 years ago

Awesome. I chose "|" to minimize reserved symbols, but "#" could be used as well.

Actually, I really like "<" for metadata instead of "|", because "|" suggests parallel structure on the LHS and RHS, but "<" suggests the RHS is a description of the LHS.

@[person < @type[employee]]...

In any case, minor details, and if Pickle supports collections & continues moving forward, BlueEyes will get support for Pickle. :)

jdegoes commented 13 years ago

Works for me. Although, the symbol "|" has a long history of being used to separate items in a collection or to describe branches of identical structure. In my mind, the best symbol to introduce metadata is ":", but it's more common in raw text than "|". I also like "<" somewhat because it indicates description. "#" suggest a comment, which is in fact a form of meta data. So maybe "#" for metadeta and "|" for element separation?

@[person # @type[employee]]
  @addresses[123 Foo Street | 123 Bar Avenue]
/@
nuttycom commented 13 years ago

That wins.

nuttycom commented 13 years ago

At least, I love it for the short form syntax. For the long form it seems like it might get a bit lost:

@[person # @type[employee]]
  Some long rambling text Some long rambling text Some long rambling text Some long rambling text 
  Some long rambling text Some long rambling text Some long rambling text Some long rambling text 
  |
  Some long rambling text Some long rambling text Some long rambling text Some long rambling text 
/@

I don't see the collection syntax being used as much in the long form, but I think it should be there. One could use @@ for the long form delimiter, and support both @@ and | in the short form so that people could choose whichever seems more consistent to them

jdegoes commented 13 years ago

Will nameless nodes be supported? These make a lot of sense for root nodes:

@[
  foo |
  bar |
  ... thousands of the above ...
]

Would meta data be allowed for collections?

@addresses[123 Foo Street | 123 Bar Avenue # @type[String]]

And finally, what's the addressing scheme look like? Every node and attribute should be uniquely addressible. Stealing some CSS:

@person @addresses#type
@person @addresses(3)

???

Difficult part is that since node children can have the same name, you need a way to disambiguate them in a select one operation:

@html @body @p(1) a(2)
@html @body @p(1) a(2)#href(2)

???

nuttycom commented 13 years ago

For root nodes, you can just omit the markup entirely:

  foo |
  bar |

That parses as:

Doc(Primitive("foo"), Primitive("bar"))

For the second question,

@addresses[123 Foo Street | 123 Bar Avenue # @type[String]]

is totally valid. This parses as:

Doc(
  Complex(
    Tag("addresses", Metadata(Complex(Tag("type"), Doc(Primitive("String"))))),
    Doc(Primitive("123 Foo Street"), Primitive("123 Bar Avenue"))
  )
)

Addressing isn't baked yet. What I know at this point is that the addressing scheme will have to have a bit more structure than CSS, because you can descend into a metadata tree to select a node.

jdegoes commented 13 years ago

Agreed on addressing scheme. In this example, I selected the 3rd 'href' node inside the metadata for the 3rd 'a' node in the 2nd 'p' node inside the 'body' node inside the 'html' node:

@html @body @p(1) a(2)#href(2)

Actually, probably you should think about immediate child versus possibly-distant descendant right from the start. CSS uses space for descendant and ">" for child:

@html > @body > @p(1) > a(2)#href(2)

A Scala DSL will likely use symbols & can benefit from non-space operators:

'html > 'body > 'p(1) > 'a(2) # 'href(2)

If you're keeping up with Anti XML, you can see some of the issues with XPath & hopefully avoid them in an addressing / querying scheme.

nuttycom commented 13 years ago

Collection support and the new metadata delimiter are now in master. Thanks for the input!

jdegoes commented 13 years ago

This is great news! Is there a reason collection support is limited to primitives? In some cases I would do:

@employees[
    @name[John Doe]
    @address[221 B Baker Street]
    |
    @name[Mary Jane]
    @address[221 C Baker Street]
    |
    ...
# @collectionType[Set] @elementType[Employee]]

In this case, Employee is not polymorphic and there is no benefit to storing additional information aside from the fields of each employee (in JSON, the fields would be wrapped in a nameless object).

Finally, I might make one much more radical suggestion: require "|" to separate all child elements of all nodes. Then primitives are treated uniformly with non-primitives and the syntax becomes more uniform.

@foo[
  a | b | c
]

@foo[
  @a[] | @b[] | @c[]
]

@foo[
  @a[] @z[] | @b[] @z[] | @c[]@z[] 
]
nuttycom commented 13 years ago

Actually, just yesterday I ran into the situation where I wanted to do exactly that (have non-primitive collections) so I'm in the process of adding it, but I have been fighting with an ambiguity... that I think your second suggestion will resolve. So yup, it's a go.

nuttycom commented 13 years ago

So, I've come up with something of a middle way that I'd like to invite feedback on. Essentially, Pickle is intended as a document markup language more than it is a structured data format. I think that it can serve as both, but I think it's necessary to do so at the proper level of abstraction. And I think that the correct thing to do is to split the handling of collection elements between the syntactic and semantic layers.

First, the problem. The present AST is this:

trait Doc extends Seq[Section]
sealed trait Section
case class Complex(tag: Tag, doc: Doc)
case class Primitive(s: String)

My first approach was to change that to be the following:

trait Doc extends Seq[Section]
sealed trait Section
case class Complex(tag: Tag, docs: Seq[Doc])
case class Primitive(s: String)

The problem with this is that it strongly biases the AST in favor of collection types. This:

@foo[a | b | c]
@bar[x]

will parse as:

Doc(
  Section(Tag("foo"), List(Doc(Primitive("a")), Doc(Primitive("b")), Doc(Primitive("c")))),
  Section(Tag("bar"), List(Doc(Primitive("x"))))
)

The problem I see is that given this, every tag handler will have to be able to provide collection semantics for its contents, or at least test that the list contains only a single value; writing the handler for "bar" means that you essentially have to raise an error if there's more than one value; this test would have to be included in every unitary tag handler. Rather than add this burden to the implementer of a tag handler, what I think I'd rather do is make it easy for a tag handler to mix in the ability to handle collection semantics. I think that the approach I'm going to take is to add a member to the Section ADT that is a separator:

sealed trait Section
case class Complex(tag: Tag, docs: Seq[Doc])
case class Primitive(s: String)
case object Separator

The above example would then parse as:

Doc(
  Section(Tag("foo"), Doc(Primitive("a"), Separator, Primitive("b"), Separator, Primitive("c"))),
  Section(Tag("bar"), Doc(Primitive("x"))))
)

This is lower level, but easy to convert the collection representation in the semantic layer. Furthermore, this opens up another possibility for the semantic layer: record semantics. I think it'd be really cool to be able to have this make sense:

@address[123 45th St. | Boulder | CO | 80305]

What do you think?

jdegoes commented 13 years ago

Makes sense. It's lower level, but at the same time is a very straightforward modeling of the actual structure of the document and doesn't throw away any information, so it's easy to build higher-level semantics on top of this.

I WOULD allow consecutive separators:

@address[|||]