Support for schema YAML key

jelovirt commented 1 year ago

Building on top of old comment, idea for a feature that would allow user to configure the schema/format of a Markdown document in the topic instead of @format in map.

Add support for specifying Markdown format/schema using a YAML metadata key $schema:

---
$schema: urn:example:schema:topic
---
# Title

Shortdesc.

Paragraph.

The URI urn:example:schema:topic is reference to configuration that specifies how the document is parsed and/or how the AST is convert to DITA; similar to doctype declaration in a DITA file. It could e.g. configure

whether the first paragraph is converted into a <shortdesc>
whether the document should confirm to core or extended MDITA profile.
whether to use parser extensions like typographic extension or autolink extension
that Markdown file is always parsed as DITA task regardless of root class names.

The magic YAML key $schema is similar to JSON Schema to make it more explicit that it's not a metadata field.

Alternatives

👎🏽 Support directly setting features in the YAML header

---
org.lwdita.features.shortdesc-paragraph: true
---
# Title

Shortdesc.

Paragraph.

I'm worried that this could lead to a dozen features needed to be configured per file and each file would have a different set of features. Having a single schema would specify the features outside the document and make it possible to use same set of features everywhere.

Implementation

Schema configuration

Schema configuration is basically equivalent to how readers are configured now in the constructor. Schema configuration would need to be able to set features. A simple way to implement this is to use ServiceLoader API.

infotexture commented 1 year ago

This might be useful indeed, but it poses the question of how to handle cases where the @format attribute in a map says to parse a topic one way, and the YAML metadata schema key says something else.

Setting features directly in the YAML header could be useful for one-off exceptions, but it could get messy with thousands of topics and multiple entries like this in each file, so I think defining rules in a central location would generally be preferable.

jelovirt commented 1 year ago

@infotexture

This might be useful indeed, but it poses the question of how to handle cases where the @format attribute in a map says to parse a topic one way, and the YAML metadata schema key says something else.

@format specifies which parser to use for parsing the file; $schema tells the parser how that specific file should be handled. Not all parsers need to support $schema, it's a feature of the parser. If a format is configured to use some features and the Markdown document uses a schema, the schema will always win.

For example,

<topicref href="file.md" format="markdown"/>

means DITA-OT will use com.elovirta.dita.markdown.MarkdownReader to read the file, as configured in plugin.xml. MarkdownReader will then find

---
$schema: urn:oasis:names:tc:dita:xsd:concept.xsd

and convert the Markdown file into DITA concept; MarkdownReader either has this support hard-coded or it has some configuration where it looks up which processing features it should use for urn:oasis:names:tc:dita:xsd:concept.xsd.

kirkilj commented 1 year ago

@jelovirt, is this an answer to the extension mechanism question I raised a while ago, in that it would free this project from fielding requests to implement features for specific Markdown flavors? Could we define a schema ourselves and code a Java parser/converter "plugin" of sorts to implement the schema-based processing?

For example, could we use this capability to provide support for attribute lists on headings? Another example would be using specific attribute list values to tweak the conversion to DITA?

Would the schema be defined in Java extension code or in a declarative schema language of some type?

If I'm reading too much into this, let me down easy - I'm technically tender right now. 😉

jelovirt commented 1 year ago

@kirkilj this is one way to achive what you want. This allows the Markdown file itself declare what schema it uses, similar to doctype declaration in XML files. You'd still have to implement the schema specific processing. In some cases you can just configure the schema to use existing features. If those are not enough, you'd need to write more Java code to implement the features you need.

It would help if you could list actual use cases for a schema specific processing in comments for this issue.

jelovirt commented 1 year ago

@kirkilj See #141 for example how a schema is configured in the current implementation draft.

kirkilj commented 1 year ago

That helps a lot. I last looked at the flexmark api before DITA-OT Day, but I recall enough from that and seeing how you implemented changes since the 3.0 release. We could simply put the schema property in the Yaml headers and forego the {.task} class attribute in the title element, which is much cleaner. We'd use the combination of the schema declaration and attribute lists on various Markdown items to be used to generate the desired DITA we'd like.

One of the use cases is to take API and CLI parameters and tag them with class attributes such as .paramname, .paramtype, .paramrequired, .paramdefaultvalue, .paramexamples so that we can decide which DITA elements to generate. Currently, these are specified either in Markdown list items or header items. Anything they want to show up in a per-topic mini-toc they put in headings. Would this extension mechanism allow us to put attribute lists on headings, which is not currently supported by the base Markdown to DITA conversion, if I recall.

Our internal R&D documentation renders using MkDocs, so we're trying to come up with something that renders fine in that environment, but also has enough semantic hints to fully leverage what DITA processing can only do. I've asked them for a few samples, which I'll append shortly.

jelovirt commented 1 year ago

Closed the initial implementation PR #141. The next steps are to define what the built-in schemas should be and how they should be distributed.

jelovirt / org.lwdita