subconsciousnetwork / subtext

Markup for note taking
Apache License 2.0
271 stars 20 forks source link

RFC for user metadata in Subtext #19

Open gordonbrander opened 2 years ago

gordonbrander commented 2 years ago

This issue tracks user-authored key-value metadata in Subtext.

Background

Key-value metadata is a generally useful primitive (see If headers did not exist it would be necessary to invent them). In our design discussions, we broadly identified two kinds of key-value metadata:

Both of these use-cases can technically be supported by the same mechanism. However, machine headers are often visually noisy and potentially confusing. We believe it is valuable to have separate mechanisms for these two features so that the mess of machine headers can be hidden from the user.

This issue tracks user-authored key-value metadata in Subtext.

Design goals

Goals:

These principles guide Subtext's design, and should frame our design solution:

Some possible approaches (not committing to any of these yet)...

HTTP-like

Title: Floop
Date: 2022-01-15
exotic-header: {"msg": "you can put anything in header body"}

Subtext content.

Tradeoffs:

Note: HTTP obsoleted line folding. Why? See #40. Subtext should avoid line folding too, both to retain line-orientedness, and to avoid the footguns the web experienced with line folding.

Note: We also use HTTP machine headers in Noosphere's memo envelope. See #41.

HTTP-like in fences

---
Title: Floop
Date: 2022-01-15
exotic-header: {"msg": "you can put anything in header body"}
---
Subtext content.

Following Jekyll and many other static site generators. The --- fence allows headers to be omitted. If the first line isn't a fence, then there are no headers.

We could use a closing fence --- or an empty line to signify end of headers.

Tradeoffs:

Special sigils

Following the rest of Subtext, keys could be prefixed by a sigil:

@ title: Floop
@ date: 2022-01-15
@ exotic-header: {"msg": "you can put anything in header body"}

Tradeoffs:

JSON

---
{
  "json": true
}
---

Subtext content.

Note this requires fences to be reasonable to parse. Don't want to embody the complexity of a JSON parser.

gordonbrander commented 2 years ago

@cdata raises the point that there are two key kinds of metadata: machine metadata and user metadata. We don't necessarily want to mix machine metadata with user "content" metadata.

One option could be to have two sets of headers in source files

Content-Type: text/subtext
Title: Discoveries result from an accumulation of errors
Date: 2020-07-10 15:00
Resolved-Name: cdata,<ENCODED_PUBLIC_KEY>
Resolved-Link: cdata,cat-facts,<ENCODED_CID>
Resolved-Name: gordon,<ENCODED_PUBLIC_KEY>
Resolved-Link: gordon,oracular-insight,<ENCODED_CID>
Exotic-Header: {"msg": "you can put anything in header body"}

Author: McLuhan
Year: 1977

> All discoveries in art and science result from an accumulation of errors.

[[Marshall McLuhan]]

I could live with this, I guess.

bburns commented 1 year ago

Had to look up what line-folding in headers meant. From the link you gave,

Historically, HTTP header field values could be extended over multiple lines by preceding each extra line with at least one space or horizontal tab (obs-fold). This specification deprecates such line folding except within the message/http media type (Section 8.3.1). A sender MUST NOT generate a message that includes line folding (i.e., that has any field-value that contains a match to the obs-fold rule) unless the message is intended for packaging within the message/http media type.

But yeah it didn't say why, which would have been useful.

I do like yaml's multiline strings also,

foo: |
  bar
  baz

Although I didn't realize it had so many ways to specify them! https://stackoverflow.com/a/21699210

bburns commented 1 year ago

Regarding header metadata, I like the idea of separating the machine vs user headers by a blank line - that makes a lot of sense.

gordonbrander commented 1 year ago

Had to look up what line-folding in headers meant.

Some more background on line folding. After looking further into line folding and why it was deprecated in HTTP, we ended up deciding to follow HTTP and not to support line folding. Documenting here https://github.com/subconsciousnetwork/subtext/issues/40#issuecomment-1221551849

gordonbrander commented 1 year ago

From @bburns in https://github.com/subconsciousnetwork/subtext/issues/38#issue-1345375255

Just adding this as a possible alternative to header keyvalue pairs, as discussed in #19.

That link mentions @ as a possible sigil, ie

@ foo: bar

Some alternatives -

.foo bar .foo=bar .foo: bar

I like this syntax as it's like property assignment in oo, and @ seems more appropriate for other uses.

For my project Neomem, I had been planning to just parse out any plain 'key: value' lines and treat them as metadata. Each item has a text representation similar to markdown, corresponding to a record in a database.

There's also 'key:: value' as in obsidian - I don't like that syntax though.

gordonbrander commented 1 year ago

From @cdata in https://github.com/subconsciousnetwork/subtext/issues/38#issuecomment-1221461890

I'm going to throw in some of our chat transcript for additional context:

At this time, there is really only one kind of data in Noosphere: subtext. Adding a new kind of data is technically easy, but it carries the trade-off that your new content type may not be well supported by other clients (much like serving arbitrary content types over the web doesn't guarantee that a web browser can view that content). That said, it is reasonable to assume that Subconscious (our Noosphere client) will support first-class rendering for common structured data formats such as JSON, CSV etc. And, we hope to explore ways to make new data types automatically legible for clients using some WASM magic (but this is a speculative and only partially formed idea, so I don't want to leave you with the impression that we have a great feature ready for you to use).

@gordon has been working on a key-value header syntax that may be suitable for what you are trying to do with Neomem. The idea is to offer a feature similar to markdown's front matter, or HTML's tag so that an author may configure header metadata from content.

Yet another way to think about it would be as a subtext "block." Block is sadly going to be a very overloaded term in our technical domain, but in the subtext context you can think of a block as being made up of the contents of a line in the subtext file. Now, subtext does not have a block type for key/value data (maybe it should). But, that doesn't necessarily stop you from interpreting any give block type that way. After all, block content is really just text that we are tagging with some inferred semantics as we parse it.

gordonbrander commented 1 year ago

From @bmann in https://github.com/subconsciousnetwork/subtext/issues/38#issuecomment-1221466556

LogSeq also does key:: value

Jekyll front matter is yaml, mostly seen as key: value.

For TiddkyWiki, I can create arbitrary custom fields name -> value on any item.

But I would say there is a lot of complexity down this path that might better be tied to the programmability of geists? EG can we include a Geist (with whatever that syntax is) and the simplest Geist might encapsulate both custom fields AND know what to do with them?

Anti-pattern example: for Jekyll or TiddlyWiki, without extensive template / display layer programming and custom theme, these custom untyped data types don’t survive past a single user.

(I’d love an example where this is not the case — because I’d love more lightly structured entities floating about)

+1 for not using @ — it has become a public good UI element that mostly means social entities (users or organizations).

bburns commented 1 year ago

I just noticed you had a kv.md file with some more notes - https://github.com/subconsciousnetwork/subtext/blob/main/explorations/kv.md. I'll include the contents here in case you'd prefer everything in one place -


Key-value blocks

We could explore expanding Subtext to support markup for key-value pairs.

Q: What is Subtext?
A: Subtext is a markup language for note-taking.

A key-value block is any alphanumeric string followed by a :. The alphanumeric string before the : becomes the sigil type for the block.

Sigil, described as a regular expression:

^[a-zA-Z0-9_]+:\s

Key-value pairs are a fundamental primitive with a wide range of potential use-cases for tooling. Like any other type of block, key-value blocks could be gathered by key into lists, concatenated, or collected using a first- or last-key-wins to get simple key/value data.

Open question: what are the implications for parsing? It would require us to run a search on across a string for an unbounded number of characters, until we encounter a space character, before defining the block as a text block. That means this search must happen to every block before it can be found to be a text block. Is this a problem in practice? Are there ways we could simplify this algorithmically?

Alternatives

@Q What is Subtext
@A Subtext is a markup language for note-taking

or

$Q What is Subtext
$A Subtext is a markup language for note-taking

Pros: can determine block type based on first character.

Cons: less natural to type.


My thoughts -

I like the idea of regexp sigils. One possibility for parsing keys would be to limit the length so the search is bounded - e.g. 255 characters -

^[a-zA-Z0-9_]{1,255}:\s

Regexps would also allow sigils like ^[-]{3,}\s to indicate a horizontal line.

gordonbrander commented 1 year ago
Q: What is Subtext?
A: Subtext is a markup language for note-taking.

One thing I like about the HTTP-header like approach is that it front-loads metadata. Metadata ends at first empty line, so if you just want metadata, you can stop pulling lines in a streaming parser after the first empty line.

A practical parsing issue with embedding header-like KV—other sigils are static characters, but unprefixed KV sigil would be dynamic. This requires lookahead and backtracking when parsing. So you would have to first begin parsing any line that might be KV, then re-parse if it is not. This wasn't obvious to me when I sketched out the exploration above.

The case with header parsing is similar, but simpler.

So the backtracking logic is simplified to just sniffing out the first line.