krakenjs / spud

A content store parser, reading a java .properties-like format
Other
14 stars 9 forks source link

Support for per-key metadata & processing directives (breaking changes) #13

Open aredridel opened 10 years ago

aredridel commented 10 years ago

Erik just suggested something I find compelling.

What about allowing metadata in the key (or, I suggest, the value) of the .properties format. (It can be carried over to the JSON format, too):

akey{translate=false}=A Key

I suggested putting the metadata in the value to simplify consumption, so that a .properties reader can ignore the metadata and defer processing to strip metadata to lookup time:

akey={notranslate}A Key

Specific flags could be defined later, with specific meanings.

As far as processing directives, it was suggested to namespace keys for special behavior with @:

@include=../../US/en/base.properties

I heartily agree.

There's a question about whether we should support multiple @include in the same file -- if it uses @include= for each, it can't be done as post-processing in .properties because only one key will be used; it would become a lexical extension.

Specific questions: What sigil to use for directives -- $, @, _ ? Any special syntax for multiple includes? Do we need multiple includes? Does the suggestion for per-value flags make sense? Is there a better one?

Would we want to introduce optional quoting of values to handle significant whitespace?

kkleidal commented 10 years ago

I like the idea of embedded meta-data. As a side note: the state-based customization I proposed in #12 would be able to handle key- or value-embedded meta data. You would just have to override the processKeyValue function accordingly. That would allow the user to choose how metadata is represented (in the key, in the value, in the comments, etc). The upside is more flexibility; the downside is less standardization.

aredridel commented 10 years ago

Yeah, I don't think that's a thing we'd want to flex without a good use-case for that variation. At a certain point you're not making a file format but a family of related ones. Expressivity goes up; meaning you can get from it goes down.

aredridel commented 10 years ago

@kkleidal I'd love push-back on any of this -- jump in, please! If we're going to break compatibility, it'd be nice to have a maintainable format, processable by a wide variety of tools, and not difficult to work with.

kkleidal commented 10 years ago

Key-embedded metadata sounds promising. It sounds very similar to XML attributes, and the '{' brackets don't seem to interfere with the rules already in place for key validation. In contrast, the '{' brackets would interfere with L10n placeholder tags in values.

The only problem which persists is the question of losslessness. I know we've discussed readjusting that goal, but my manager just gave me new insight into why it has been a goal in the past. In the past, localization has aimed to be byte-perfect, meaning that PJM's/linguists could do a diff between translated and non/translated files and the only thing the diff should pick up is the translated strings themselves. If the deserialization/serialization isn't lossless, this process won't work as is; it would require a new diff tool (and acceptance of the new diff tool). Abandoning losslessness would mean changing the process of localization (which isn't necessarily bad, it's just... well, change). I just want to offer up that perspective.

aredridel commented 10 years ago

Yeah. I think losslessness for values is super important -- but I don't think it is for metadata at all.

That said, validating work is important.

aredridel commented 10 years ago

The reason I hesitate is that losslessness is expensive. It implies certain shapes of the API, it restricts what must be stored in data structures, it's just a big chunk of stuff hanging out, but not relied on directly. That dependency is implicit, too. It can't be searched for, it's just something you have to maintain and not break, forever.

kkleidal commented 10 years ago

Well then maybe we should look into rewriting diff. I don't know how deep that rabbit hole runs though; I'll have to talk more with my manager and @diegone (who is out today).

aredridel commented 10 years ago

Yeah. That's a good concern -- it may, also, not be a long term problem if what's stored in repos tends to be the output of the process.

Perhaps adding 'diff' functions to spud is valuable. I think that's a good place for it.

diegone commented 10 years ago

Here are my thoughts:

I think the real objectives here are:

Besides, here are some l10n concerns not currently addressed:

Some of the issues above are not spud-specific, so there lies a challenge orchestrating multi-module changes.

aredridel commented 10 years ago

hacking the value to something like {notranslate}A Key seems clever from an engineering perspective that the parsing/retention of meta is delayed until consumed/needed, but doesn't seem particularly logic/intuitive. Besides needing different better separators so they won't be confused as placeholders, they'd need special processing so they are not exposed to linguists and it would introduce new escaping requirements of the new separators

Yes. It would be a new file format, with proper support for metadata, explicit, not as "comments". Superficially similar, and pretty easy to learn.

aredridel commented 10 years ago

enriching the key to akey{translate=false} starts being awfully similar to an SGML-like <akey translate=false>. Not sure the argument that Shift-, is much more user friendly than Shift-[ will hold

Yes. Adding metadata explicitly looks something like this in any case.

aredridel commented 10 years ago
  • I wouldn't advocate it for enterprise use, but for a small project, even a CSV file would be better than this format:
    • it's a standard format everybody understands
    • it can be opened in Excel and almost every l10n vendor supports it
    • it's easy to add metadata to keys, just add a new column!

I'd strongly disagree here. CSV leaves major points unspecified: Character set, line ending, quote handling, how multi-line values are handled, what the columns mean, whether to use headers. There are very few interoperable CSV implementations. Excel and LibreOffice disagree on at least three of these points and require explicit configuration to override -- and in some cases, there is no option to override.

It's also distinctly difficult to get right when editing manually, and errors cause misparses rather than outright failures.

In addition, it's merely a transport format with no meaning. Any meaning has to be applied as another layer on top.

aredridel commented 10 years ago

I think the argument that transformation needs to preserve space because we run diffs on them doesn't make much sense. If we have proper validation tools there is little reason to compare source files and target files. However, it makes sense to diff one version of a translated file and another so we should guarantee that in that case the diff would only show up meaningful changes (e.g. order of keys is preserved)

I'm inclined to agree.

aredridel commented 10 years ago

preserving leading/trailing white-space in values is not desirable, in fact it should be trimmed. L10n processes/linguists are not guaranteed to maintain non-linguistic spaces, so it should not be relied on

Makes sense, considering the number of layers that would have to preserve such things. We'll have to call this out explicitly. Right now spaces are preserved in the .properties

aredridel commented 10 years ago

if we're going to come up with a new made up format, we should have very good reasons why we can't use an existing format for which parsers/editors/validators/documentations may already exist and be well understood

:+1:

aredridel commented 10 years ago

standardize in the file format l10n concerns so that they are not custom hacks patched on by some l10n team by are officially sanctioned by the framework

Agreed. Having multiple groups responsible for different aspects of an underspecified file format is asking for trouble. "Those are just comments" from one perspective are "that's critical information" from another, for instance.

aredridel commented 10 years ago

standardize an in-memory representation of resource files so that it can be consumed both by runtime and back-office needs => JSON has a JS Object/Array in-memory model, XML has a DOM in-memory model, what is the in-memory model of this format? If the answer is "depends" then it's difficult to build on top of. Not sure why we would want different in-memory models depending on use cases unless the overhead is unbearable

Consuming translations for display is in essence a hash table. Unordered, fast lookup, by key. In addition, we expect to be able to organize them in groups, and to iterate in order a few specific parts.

Consuming translations for translation is essence a list, with metadata associated.

These are at odds with each other.

aredridel commented 10 years ago

be resilient to merge issues: format should prevent as much as possible issues caused by bad merges (e.g. current translate begin/end is convenient but brittle, alternative translate flag per key is more solid but cumbersome)

Good concern. This is among my discomforts with begin/end markers, and doubly so with ones that are 'advisory' rather than syntactic as in comments.

aredridel commented 10 years ago

values are not standardized: need to know what's a placeholder, what's translatable, what encoding to apply (e.g. html entities, js escapes, html attributes, dust placeholders)

Agreed, though not sure it should be solved at this layer. We are talking about a markup language at this point.

totherik commented 10 years ago

This thread got really off topic, so much so I'm inclined to lock it as it no longer addresses the question at hand. This is an attempt to get it back on track.

Proposal

# `@` prefix to denote document-level metadata/pragmas
@include my/other/content.properties

# Use of `{}` delimited KVPs to denote key-level metadata
copyright{translate=false}=Copyright 2014, PayPal, Inc.

Request for Feedback

The question here is, does this proposal get us closer to where we need to be while leaving us the flexibility to expand upon it in the future?

If it does not, please state specifically which parts will not work and suggestions for altering it such that the design gets us closer to our goal.

(Note: this file format is already an implicit "spec" due to the use of utf8 and the inclusion of support for indexed and subscript key as discrete datatypes. As a result, this proposed change does not result in a "new format.")

diegone commented 10 years ago

What is the syntax/semantics of the @ proposal? is it a k-v pair or it depends on the actual directive? What is the implied in-memory representation?

In other words: is @ semantically another comment or another key or something else?

aredridel commented 10 years ago

For the purposes of consuming content for display at runtime, @include otherfile will load the contents of the other file, filling in unset keys. Syntactically, it is not a content key-value pair.

The syntax can be used for other document-level metadata.

Again, for the purposes of display, we eventually need an efficient map from key to value. If we define metadata that would be interpreted at that time, we will need an efficient map from key to an object representing value and line-level metadata.

For the purpose of translation and localization, I can see treating the files separately making a lot of sense, leaving their source structure intact.

There is no implication of in-memory representation in general.

diegone commented 10 years ago

copyright{translate=false} is better than a comment. How do we support multiple meta keys that may contain whitespace like a localization note? It sounds like we may need quotes around values. Which also means we need some sort of escaping rules.

In terms of extensibility, it'd be nice to make the syntax/semantics extensible so it doesn't become a breaking change every time we introduce something new.

aredridel commented 10 years ago

Semantics are easier to make extensible than syntax, since there's opportunities for namespacing. I think this proposal provides it -- document and line-level metadata, extensible by allocating another name.

Quoting around values makes sense. How about JS-style quoting, as used in javascript property names? Optional for simple bare words, quote and escape otherwise:

PropertyName :
    IdentifierName
    StringLiteral
    NumericLiteral

Syntax being extensible means you leave swathes 'undefined' and then carefully pick around any messes made. Let's avoid that.

diegone commented 10 years ago

Feel free to tell me to shut up if I'm off-topic.

The reason I was bringing up other requirements, was to address "is it extensible" question and highlight possible new breaking changes that may be required in the future. If we think after this we're done, we may be good to go. But if we think we're going to add more to it (e.g. to have a standardized message format) and potentially if we want to make it more usable (e.g. today keys are not scoped and flattened out so you need to repeat a.b.c.d over and over), then the format we end up with deviates from a Java/iOS properties file format so much that its similarity may hurt us more than help us.

For example, extending the example above to:

copyright{translate=true locNode='use {currentYear} so that we don't have to change it every year'}=Copyright 1999-{currentYear}, PayPal, Inc.

would lead to parsing errors/unpredictable results and it wouldn't be obvious because this is a new format nobody is familiar with.

What I suggest is that we mock what the end game may look like and then see whether it's better adding more features to this format, coming up with a new one, or extending a different one (e.g. https://github.com/aseemk/json5)

aredridel commented 10 years ago

Yeah. I think we explicitly don't want JSON-like; json5 fixes some of the error prone spots, but I don't think it brings anything useful to the table -- certainly not metadata, we'd have to invent that as a layer on top. If there's something specific we should use, by all means, bring it up. I think we'll probably reject XML out of hand, because developers mostly hate it. I don't know of another format that's particularly suited or has amazing tooling that we care about beyond that.

I'm not sure what your extended example means or what the problem is there, other than that there's an unescaped apostrophe. We can define an unambiguous grammar here, though. This doesn't have to be dicey.

I think message format can be left orthogonal -- and decisions around it deferred -- at this point by accepting any unicode sequence as a value. I don't think there's any value in tying those parts together.

diegone commented 10 years ago

That was my point. If you can't figure out what's wrong in my example, the average developer unfamiliar with this custom format would be even more confused. The other mistake in my example is that the } is also unescaped and would inadvertently close the meta block. Nobody is trained to escape braces, so it's error-prone.

The value of adopting something like json or json5 is that everybody knows the rules of correct json or js syntax -- you don't need to explain that you need to escape braces. On top of it you get benefits like nested objects instead of repeated unscoped keys.

What is the argument about preferring an unfamiliar new format instead of something every js developer is already familiar with? (like js syntax)

By using something like json5 you have a rich-enough familiar syntax you can use to support new features later on. I don't think the messageformat discussion is orthogonal because tomorrow when you try to define how to represent a placeholder, it's not going to be "any unicode sequence" anymore and you're going to have to make breaking changes again and invent some new syntax and new escaping rules. All I'm saying is: can we settle on a syntax/format that is robust enough to support future requirements we already know we want?

aredridel commented 10 years ago

No, I can't figure it out because it hasn't been defined. ;-) Using the rules I'd proposed, the braces wouldn't close anything, because they're quoted.

The JSON family though don't give us a structure for metadata -- we'll have to impose a restriction on it to create a format.

I don't think we consider nested objects without the repetition a benefit. Certainly painful for merges.

The argument for a format for the purpose is to restrict it to the task at hand: The continuum of expressiveness to specificity has trade-offs; choose a generic container like json and you have to restrict its use quite a lot to express what you mean; choose more generic like bytestreams and you must restrict it greatly to specify what you mean.

On the other end of the spectrum, tighten it too much -- like the existing .properties format -- and you can't express what's needed at all -- you end up recycling underused syntax for metadata, tacking it on in the margins.

rragan commented 10 years ago

I need to read through all this list but for what it's worth, we have extended .properties to allow target metadata without breaking things for the general user. Syntax is: contentElement.testTarget.@target[DE,AT,EBAY_DE,EBAY_AT]=Australian DE & AT target TestPrj where the metadata is in the last element with a name of @target and values in brackets following it. This format is transparently mapped to/from the 4cb representation.

aredridel commented 10 years ago

Yeah. Overloading the key is one of the options for doing this; that said, at a certain point, you start adding hard-to-preserve semantics. Consuming those with vanilla spud would be difficult, too. A post-processing step is needed in that case.

diegone commented 10 years ago

When you say "we have extended" is it implemented in some fork?

totherik commented 10 years ago

Again, I need to bring this discussion back on track.

If there's is a proposal to use an alternate document format, please file that as a separate issue. As far as I can tell that should merely involve writing a serializer for that particular format. Also, if the use of a properties-based format is a complete non-starter for adoption, a good place to capture that rationale is in the context of a proposed alternate solution or file format. It is not a good use of anyone's time to keep churning in this fashion.

Regarding this proposal, if there are scenarios that seem problematic with the proposed design, please propose alternative solutions. A general "this won't work due to this 2% use-case" is not constructive and has not added value to this discussion.

totherik commented 10 years ago

Apparently some comments were made while I was authoring my previous one and they're pertinent to this topic, so I want to make sure this isn't lost:

@rragan, per @diegone

When you say "we have extended" is it implemented in some fork?

rragan commented 10 years ago

Proposal:

Same @xxxx as Erik's for document-level metadata/pragmas

For key level metadata: @xxx[params].@yyy[params]... where all such metadata appears at the end of the key after the parts of the key. This keeps the key intact as an entity and further limits ambiguity by making @ metadata all be grouped at the end of the key. Examples:

hello=Hi hello.@target[DE].@translate[no]=Guten Morgen

The syntax is consistent with existing general rules for tokens that can appear in the key. It provides a direct mapping from XML attributes to this format (and back).

We have a working converter from this format of property file to 4cb's and back again without loss of fidelity. See the ContentConverters repo (an internal repo so not generally accessible). It is in Java because L10N tools are in Java and want to go 4cb<->properties programmatically. Note that there is no support for the target concept in content yet. This work was a step down that path.

Note: Use of [ ]'s vs Erik's { }'s is no big deal. We tried to stay within the general token formats of the key and { } did not do that.

totherik commented 10 years ago

Great, thanks for the detail @rragan.