gordonbrander commented 2 years ago

This issue tracks user-authored key-value metadata in Subtext.

Background

Key-value metadata is a generally useful primitive (see If headers did not exist it would be necessary to invent them). In our design discussions, we broadly identified two kinds of key-value metadata:

Machine metadata (tracked in #41). This includes things like cache-control headers, content-type flags, user agent strings, etc. This kind of metadata is often invisible to the user and authored by the program.
User metadata (tracked here in #19). Anything authored by the user for their own purposes.

Both of these use-cases can technically be supported by the same mechanism. However, machine headers are often visually noisy and potentially confusing. We believe it is valuable to have separate mechanisms for these two features so that the mess of machine headers can be hidden from the user.

This issue tracks user-authored key-value metadata in Subtext.

Design goals

Goals:

Should be expressive enough that an app can reify its state as metadata, such that any internal caches, indexes, or databases can be regenerated from files.
Should provide a content-type mechanism
Ideally, plain text body content should be readable as plain text

These principles guide Subtext's design, and should frame our design solution:

Line-oriented: line-orientedness is a core design value for Subtext.
Simple to parse: prefer context-free and regular grammars, such as balanced brackets, that can be easily implemented through regular expressions, and require minimal backtracking.
What you would write anyway: as much as possible, we should aspire to choose syntax that is what you might write anyway. This is a delicate design goal that must be navigated by intuition. Examples of syntax like this are headers in email/HTTP, and blockquotes in Markdown.

Some possible approaches (not committing to any of these yet)...

HTTP-like

Title: Floop
Date: 2022-01-15
exotic-header: {"msg": "you can put anything in header body"}

Subtext content.

Tradeoffs:

Pro: It's what you would write anyway
Pro: Header keys are case-insensitive, so you can write them nice style
Pro: It works for email and the web
Pro: Header bodies are specified separately from header syntax. This makes headers a completely open-ended extension mechanism.
Pro: different applications can write to the same file without conflicts provided they use their own header fields (e.g. subconscious-meta, obsidian-meta)
Pro: Happens to fit with the way SQL thinks about JSON. Blobs of JSON belong to a specific column.
Con: Trends toward bespoke DSLs for each header.
- Pro: Although from the perspective of open-endedness, this is a pro.
- Synthesis: we could strongly encourage all headers to be one of a few types: String, Number, Bool, JSON. At the same time, it would be technically possible to do things beyond these types.
Con: Requires at least one header, or a line break at the beginning of the file
- Mitigation: you can sniff the first line and skip header parsing if it is not a header.

Note: HTTP obsoleted line folding. Why? See #40. Subtext should avoid line folding too, both to retain line-orientedness, and to avoid the footguns the web experienced with line folding.

Note: We also use HTTP machine headers in Noosphere's memo envelope. See #41.

HTTP-like in fences

---
Title: Floop
Date: 2022-01-15
exotic-header: {"msg": "you can put anything in header body"}
---
Subtext content.

Following Jekyll and many other static site generators. The --- fence allows headers to be omitted. If the first line isn't a fence, then there are no headers.

We could use a closing fence --- or an empty line to signify end of headers.

Tradeoffs:

Pro: used in a lot of tools already
Con: mildly annoying to need fences
Con: closing fence is visually pretty, but feels redundant and less elegant than empty line break.

Special sigils

Following the rest of Subtext, keys could be prefixed by a sigil:

@ title: Floop
@ date: 2022-01-15
@ exotic-header: {"msg": "you can put anything in header body"}

Tradeoffs:

Pro: Consistent
Con: Redundant.
Interesting: you can pepper metadata throughout a document. Is this a pro or a con? It means you must parse the whole file.

JSON

---
{
  "json": true
}
---

Subtext content.

Pro: does exactly what it says on the tin.
Pro: JSON is MAYA
Con: less open-ended than headers. Restricted to JSON semantics and types. (Could be construed as a pro for basic cases).
Con: JSON is a pain to write by hand for simple headers
Con: confusing for non-technical users.

Note this requires fences to be reasonable to parse. Don't want to embody the complexity of a JSON parser.

gordonbrander commented 2 years ago

@cdata raises the point that there are two key kinds of metadata: machine metadata and user metadata. We don't necessarily want to mix machine metadata with user "content" metadata.

One option could be to have two sets of headers in source files

Content-Type: text/subtext
Title: Discoveries result from an accumulation of errors
Date: 2020-07-10 15:00
Resolved-Name: cdata,<ENCODED_PUBLIC_KEY>
Resolved-Link: cdata,cat-facts,<ENCODED_CID>
Resolved-Name: gordon,<ENCODED_PUBLIC_KEY>
Resolved-Link: gordon,oracular-insight,<ENCODED_CID>
Exotic-Header: {"msg": "you can put anything in header body"}

Author: McLuhan
Year: 1977

> All discoveries in art and science result from an accumulation of errors.

[[Marshall McLuhan]]

I could live with this, I guess.

bburns commented 1 year ago

Had to look up what line-folding in headers meant. From the link you gave,

Historically, HTTP header field values could be extended over multiple lines by preceding each extra line with at least one space or horizontal tab (obs-fold). This specification deprecates such line folding except within the message/http media type (Section 8.3.1). A sender MUST NOT generate a message that includes line folding (i.e., that has any field-value that contains a match to the obs-fold rule) unless the message is intended for packaging within the message/http media type.

But yeah it didn't say why, which would have been useful.

I do like yaml's multiline strings also,

foo: |
  bar
  baz

Although I didn't realize it had so many ways to specify them! https://stackoverflow.com/a/21699210

bburns commented 1 year ago

Regarding header metadata, I like the idea of separating the machine vs user headers by a blank line - that makes a lot of sense.

gordonbrander commented 1 year ago

Had to look up what line-folding in headers meant.

Some more background on line folding. After looking further into line folding and why it was deprecated in HTTP, we ended up deciding to follow HTTP and not to support line folding. Documenting here https://github.com/subconsciousnetwork/subtext/issues/40#issuecomment-1221551849

gordonbrander commented 1 year ago

From @bburns in https://github.com/subconsciousnetwork/subtext/issues/38#issue-1345375255

Just adding this as a possible alternative to header keyvalue pairs, as discussed in #19.

That link mentions @ as a possible sigil, ie

@ foo: bar

Some alternatives -

.foo bar .foo=bar .foo: bar

I like this syntax as it's like property assignment in oo, and @ seems more appropriate for other uses.

For my project Neomem, I had been planning to just parse out any plain 'key: value' lines and treat them as metadata. Each item has a text representation similar to markdown, corresponding to a record in a database.

There's also 'key:: value' as in obsidian - I don't like that syntax though.

gordonbrander commented 1 year ago

From @cdata in https://github.com/subconsciousnetwork/subtext/issues/38#issuecomment-1221461890

I'm going to throw in some of our chat transcript for additional context:

At this time, there is really only one kind of data in Noosphere: subtext. Adding a new kind of data is technically easy, but it carries the trade-off that your new content type may not be well supported by other clients (much like serving arbitrary content types over the web doesn't guarantee that a web browser can view that content). That said, it is reasonable to assume that Subconscious (our Noosphere client) will support first-class rendering for common structured data formats such as JSON, CSV etc. And, we hope to explore ways to make new data types automatically legible for clients using some WASM magic (but this is a speculative and only partially formed idea, so I don't want to leave you with the impression that we have a great feature ready for you to use).

@gordon has been working on a key-value header syntax that may be suitable for what you are trying to do with Neomem. The idea is to offer a feature similar to markdown's front matter, or HTML's tag so that an author may configure header metadata from content.

Yet another way to think about it would be as a subtext "block." Block is sadly going to be a very overloaded term in our technical domain, but in the subtext context you can think of a block as being made up of the contents of a line in the subtext file. Now, subtext does not have a block type for key/value data (maybe it should). But, that doesn't necessarily stop you from interpreting any give block type that way. After all, block content is really just text that we are tagging with some inferred semantics as we parse it.

gordonbrander commented 1 year ago

From @bmann in https://github.com/subconsciousnetwork/subtext/issues/38#issuecomment-1221466556

LogSeq also does key:: value

Jekyll front matter is yaml, mostly seen as key: value.

For TiddkyWiki, I can create arbitrary custom fields name -> value on any item.

But I would say there is a lot of complexity down this path that might better be tied to the programmability of geists? EG can we include a Geist (with whatever that syntax is) and the simplest Geist might encapsulate both custom fields AND know what to do with them?

Anti-pattern example: for Jekyll or TiddlyWiki, without extensive template / display layer programming and custom theme, these custom untyped data types don’t survive past a single user.

(I’d love an example where this is not the case — because I’d love more lightly structured entities floating about)

+1 for not using @ — it has become a public good UI element that mostly means social entities (users or organizations).

bburns commented 1 year ago

I just noticed you had a kv.md file with some more notes - https://github.com/subconsciousnetwork/subtext/blob/main/explorations/kv.md. I'll include the contents here in case you'd prefer everything in one place -

Key-value blocks

We could explore expanding Subtext to support markup for key-value pairs.

Q: What is Subtext?
A: Subtext is a markup language for note-taking.

A key-value block is any alphanumeric string followed by a :. The alphanumeric string before the : becomes the sigil type for the block.

Sigil, described as a regular expression:

^[a-zA-Z0-9_]+:\s

Key-value pairs are a fundamental primitive with a wide range of potential use-cases for tooling. Like any other type of block, key-value blocks could be gathered by key into lists, concatenated, or collected using a first- or last-key-wins to get simple key/value data.

You could execute queries such as: “list all questions (Q: blocks) in my notes”.
You could transform a collection of notes into a sparse table by treating each note as a row, and treating keys as columns. Denser tabular data can be had by filtering notes to only include those with a particular set of keys, and then concatenating, JSON-encoding, or dropping duplicate keys. Tada! CSV.
You could include headmatter in the body of a note. This can make it easier to integrate notes with static site generators, or other tools.

Open question: what are the implications for parsing? It would require us to run a search on across a string for an unbounded number of characters, until we encounter a space character, before defining the block as a text block. That means this search must happen to every block before it can be found to be a text block. Is this a problem in practice? Are there ways we could simplify this algorithmically?

Alternatives

@Q What is Subtext
@A Subtext is a markup language for note-taking

or

$Q What is Subtext
$A Subtext is a markup language for note-taking

Pros: can determine block type based on first character.

Cons: less natural to type.

My thoughts -

I like the idea of regexp sigils. One possibility for parsing keys would be to limit the length so the search is bounded - e.g. 255 characters -

^[a-zA-Z0-9_]{1,255}:\s

Regexps would also allow sigils like ^[-]{3,}\s to indicate a horizontal line.

gordonbrander commented 1 year ago

Q: What is Subtext?
A: Subtext is a markup language for note-taking.

One thing I like about the HTTP-header like approach is that it front-loads metadata. Metadata ends at first empty line, so if you just want metadata, you can stop pulling lines in a streaming parser after the first empty line.

A practical parsing issue with embedding header-like KV—other sigils are static characters, but unprefixed KV sigil would be dynamic. This requires lookahead and backtracking when parsing. So you would have to first begin parsing any line that might be KV, then re-parse if it is not. This wasn't obvious to me when I sketched out the exploration above.

The case with header parsing is similar, but simpler.

Parse first line.
- If it is a valid header, parse lines as headers until the first empty line. Discard any malformed header lines.
- If it is not a valid header, backtrack and re-parse as body part.

So the backtracking logic is simplified to just sniffing out the first line.

subconsciousnetwork / subtext

RFC for user metadata in Subtext #19

Background

Design goals

HTTP-like

HTTP-like in fences

Special sigils

JSON

Key-value blocks

Alternatives