Epic: Improve type system

marcusolsson commented 2 years ago

This issue gathers discussions and issues related to the Projects type system. The plugin was released with a rather simple type system that needs some improvements. Since the type system is a corner-stone of the Projects plugin, a well-designed type system is crucial for the success of Projects.

Rationale

Notes are just text. Text can be incredibly powerful as it can contain pretty much anything, and users can use whichever editor they want to edit it. Unfortunately, the downside of this power is that you can't make any assumptions on pure text.

The purpose of the Projects type system is to extract well-known types of data from pure text so that any views can act as if they communicated with a database.

For example, if Projects detects the text publish: true between a pair of delimiters, ---, it detects it as a boolean field (true/false).

By defining a piece of text to have a type, we can also define operations on those types. Boolean fields can be toggled on and off. Numbers can be used for mathematical operations.

Since they no longer need to care about parsing text, views become easier to build. And since they have a shared understanding of the data, the user can define the data once, and view it in many different ways.

Design

The type system is the same regardless of where the data came from. In Projects, data sources acts as translation layers (or adapters) that convert data into the shared data format.

For example, the front matter data source knows how to convert from YAML front matter, and the Dataview data source know how to convert the Dataview result into a shared format.

Challenges

Fields have multiple types

Consider the following two notes:

publish: true

publish: 3

When the data source read the first note, it detects publish as a boolean field. When it reads the second note, it detects publish as a numeric field. Since the type system can't determine the type, it needs to fall back to a common denominator. Hence publish is detected as a String field.

Related issues:

71
100
183

Missing type information

If a field is empty (null) in all notes, there's nothing we can use to detect the type. In this case, the plugin should fall back to a String field.

publish:

publish:

Complex types

Notes may also contain complex type. For example, the Kindle plugin adds the following front matter:

kindle-sync:
  bookId: '3759'
  title: Design for How People Learn (Voices That Matter)
  author: Dirksen Julie
  asin: B018OJP5QW
  lastAnnotatedDate: '2022-05-06'
  bookImageUrl: 'https://m.media-amazon.com/images/I/51M0cX78XnL._SY160.jpg'
  highlightsCount: 1

Projects should not attempt to edit fields it doesn't understand. Ideally, I'd like to detects these as nested types, types that can contains other types. That way, you could still build a Kindle view that understands the field value. Other views might not even display them.

Repeated fields

Consider the following examples where values is a field containing multiple values that may or may not share the same type.

values:
  - 10
  - 20
  - 30

values:
  - 12
  - true
  - [[Hello]]

I propose that we introduce a repeated property for each field. The type of the field is detected as if the repeated values came from separate notes. For example, in the first example, values would be detected as a repeated number, and in the second example, it would be a repeated string (because of falling back to string).

Rich-text support

In the following example, all of these front matter properties would be detected as text (string) fields:

---
math: $x^2$
html: This has <strong>HTML</strong> in it
markdown: This is **Markdown** formatted text
---

# My note

Should these be implemented as separate field types, separate from the String field type? For example, HTML fields and Markdown fields, or should the current String type support parsing of these formats?
In the case of separate field types, automatically detecting these field types would be tricky, so the user would likely have to tell which String fields that should be parsed in a certain way.
If we assume that all cells of String fields can have Markdown, HTML, Latex, it will likely have a non-trivial performance hit on projects with a lot of notes.

Related issues:

66
64

Fields derived from Markdown content

The type system should support Markdown elements, such as tasks. Ideally, I'd like to find a way to avoid creating specific types for Markdown content, but rather reuse the standard types.

## Heading

- [ ] Do this
- [ ] Do that
- [x] Already did that

## Another heading

- [ ] Do this
- [ ] Do that
- [x] Already did that

Markdown elements could be represented as repeated fields. For example, tasks could be typed as repeated boolean fields and headings as repeated string fields.

Related issues:

149
164

GamerGirlandCo commented 1 year ago

In the case of separate field types, automatically detecting these field types would be tricky, so the user would likely have to tell which String fields that should be parsed in a certain way.

i actually like the idea of separate string field types, and having the user distinguish between them via a dropdown setting. as for how the type would be persisted, i was thinking of using frontmatter like this:

stringfield:
  type: html
  content: <div>nice job!</div>

marcusolsson commented 1 year ago

stringfield:
  type: html
  content: <div>nice job!</div>

This would introduce a custom format, which would go against the leave no trace principle. We could of course parse all string fields as if they contained HTML, Markdown, LaTex etc, but that would likely be a performance hit.

Maybe the user just have to choose to enable parsing for the fields they want parsed.

Acylation commented 1 year ago

but that would likely be a performance hit.

How about auto-detection combined with user-specified types?

When initializing the project or a new field detected, run type detection and save the results as project-level configuration. User can define another type which overides the detected ones. "Type" here affects how data parsed, displayed and edit, while the raw values are still in string format.

To deal with existed incorrect values, just don‘t parse them and display them in string format, or return an error. When a incorrect input detected, send message as notificaion.

To parse user-defined/complex types, provide custom parsers that transfer string to fit different views. Related to #64

HananoshikaYomaru commented 1 year ago

Hi. I am the developer or 3d graph and frontmatter generator. I made a video about how to make tag properties using frontmatter generator. Basically I can strong typed the properties using zod.

recently I have been thinking about ways to integrate frontmatter generator with Projects. Hopefully It can help. Also I am interested in helping out making Projects work more seamlessly with frontmatter generator.

LynetteCullens commented 3 months ago

r. Basically I can strong typed the properties using zod.

recently I have been thinking about ways to integrate frontmatter generator with Projects. Hopefully It can help. Also I am interested in helping out making Projects work more seamlessly with frontmatter generator.

Did they allow you to help? It'd been half a year.

simonausten commented 1 week ago

TL;DR: YAML-LD in the frontmatter, with e.g. Schema.org semantics might solve this.

stringfield:
  type: html
  content: <div>nice job!</div>
This would introduce a custom format, which would go against the leave no trace principle. We could of course parse all string fields as if they contained HTML, Markdown, LaTex etc, but that would likely be a performance hit.

Maybe the user just have to choose to enable parsing for the fields they want parsed.

There's a middle ground, which could be the "leave no trace that can't be universally interpreted", and that sounds like a job for RDF, specifically the YAML-LD syntax. Check out https://www.w3.org/community/reports/json-ld/CG-FINAL-yaml-ld-20231206/

In the example above, if a note needs a field called "stringfield" with type html and value <div>nice job!</div>, you could do:

stringfield:
  - rdf:type: rdf:html
  - rdfs:label: "<div>nice job!</div>"

This is 100% interoperable with any other system that chooses to read the document, and therefore leaves no unintelligible trace.

A simple example from those documents that says "this document, which uses the schema.org semantics, has the unique id https://w3.org/yaml-ld/ and is of type 'WebContent'. The name of the document is YAML-LD, and the author is as follows..."

"@context": https://schema.org
"@id": https://w3.org/yaml-ld/
"@type": WebContent
name: YAML-LD
author:
  "@id": https://www.w3.org/community/json-ld
  name: JSON-LD Community Group

The beauty of this approach is that you don't need to define the type system because it already exists, here: https://schema.org/WebContent , and every available property has a strong opinion on its data type.

Best of all, any markdown document containing this data is interoperable with any other system that uses RDF.

One hurdle is that Schema.org's task and project management provision isn't well defined, but any project could be enumerated using https://schema.org/Project, with tasks defined using https://schema.org/PlanAction, which has the properties scheduledTime, actionStatus, agent, endTime, error, instrument, location, object, participant, provider, result, startTime, target as well as others it inherits. Or, you could implement any other vocabulary as required, or your own, without loss of interoperability.

marcusolsson / obsidian-projects