Links URL and Relative Paths

tajmone commented 2 years ago

@pml-lang, while working on the pandoc PML Writer I've started including test documents from pandoc's test suite. These are documents in various formats covering many formatting elements and their possible combinations.

Although the Lua filter is correct, many of these documents fail conversion via PMLC because they either contain images pointing to the Internet, or links with relative paths, etc., which PMLC doesn't support.

Example, a document containing a link pointing to /url results in the following PMLC error:

pmlc writer.pml
Error      '/url' is an invalid value for parameter 'url'. Reason:
           '/url' is an invalid URL Reason:
           '/url' is not a valid URL. Reason:
           no protocol: /url
Code       [ch [title Level 2 with an [link url=/url text="embedded link"]]

Error id   INVALID_ATTRIBUTE

I really can't understand why PMLC enforces these restrictions on links and images paths, acting like a sort of validator.

The inability to use images stored on the web, or provide links that use relative paths, or protocols others than those accepted by PMLC (from gopher:// to MS res://, up to custom URL Monikers) is a huge impediment in the adoption of PML for many domain specific applications:

Relative URLs are essential for publishing web pages, and an /url link like the one from the above error message would be a normal link on a custom server.
Support for the res:// protocol is essential for storing HTML documents (or images and other assets) into DLLs, and then accessing them from within an application (e.g. via the WebControl).
Custom URL Monikers are essential for launching installed apps (with options) from a webpage (Skype, GitHub Desktop, etc.) or to implement applications like Executable eBooks (which serve pages via URL monikers to protect contents).
PML only recognizes as valid protocols a very limited number of the existing protocols — I've tested them, but don't have the full list here, but I remember that legacy protocols like gopher, telnet, and many others cause conversion to fail.

When I think of the HTML format and it's innumerable applications (from HTML based eBook formats to software using WebComponent GUIs) I really struggle with these limitations, especially not being able to include images from the web.

Also, why should PMLC check if an image exists at all? In many documentation toolchains the images are generated on the fly, from ASCII blocks within the source document (or elsewhere), so they are deleted before each, and not available until after the conversion. Procedurally generated images are such a strong component of software documentation that I can hardly imagine working without them (think of railroad diagrams, etc.).

You should really consider either revising the way PMLC handles URL attributes, or at least provide some alternative attributes that allow handling relating paths within any URI scheme and/or custom protocols or monikers.

I fail to see the reasons for the current errors, since all of the above mentioned cases adopt the same convention for how a protocol/moniker is expressed in terms of <name>://path/segmented/by/slashes/.

The fact that so many basic test docs from pandoc test suite are failing to build with PMLC — not due to failed pandoc to PML conversion, but because of unsupported paths/URLs in PMLC — is a strong indicator that there's something wrong with how PMLC approaches resources paths and links. All the markup syntaxes I've worked with don't attempt to validate paths and URLs (unless you specify options like embedding images as Data URIs), for they assume the author knows what he/she's doing — it could be just a document template, or the image is temporarily unreachable due to a server being done, or the HTML page is intended to be used inside a running executable app, from a DLL ... you name it, there could be hundreds of reasons why an external asset is not reachable at the specified path/URL.

I think these problems have to be solved natively, i.e. in the vanilla PML syntax, as opposed to via custom scripts or extensions.

What are the reasons behind the current way PMLC handles images paths and links? Why links are expected to be expressed via protocols, when an HTML page might simply want to link to another page in the same folder (possibly without having to resort to the file:// protocol)?

pml-lang commented 2 years ago

I agree 100% with everything mentioned in your post. These limitations will be removed in the next PML version (the version I'm currently working on, written in Java). Absolute and relative URLs and file paths will be supported for all media assets, and the paths and URLs will no more be validated by PMLC (unless explicitly asked for).

tajmone commented 2 years ago

I agree 100% with everything mentioned in your post. These limitations will be removed in the next PML version (the version I'm currently working on, written in Java).

Wonderful! I could then finally start using PML in executable ebooks (which all use custom protocols like ebook:///, etc.) and in-software documentation (via WebControl).

Absolute and relative URLs and file paths will be supported for all media assets, and the paths and URLs will no more be validated by PMLC (unless explicitly asked for).

The idea of allowing validation on demand is good.

How are you going to implement it, via CLI option or per element via node attributes?

That's actually a feature I've been pondering on in my free time, when I try to think of missing features in the various markup languages which could be added to PML. A draft idea I had in mind was the ability to enable gathering info on an image at conversion time, e.g. to calculate its width and height and inject them in the final HTML in order to ensure that its placeholder (while loading) is of the correct size.

E.g. it could be something like:

[image ( source=pear.png width=extract ) ]

where the special value extract implies that PML should locate the image and extract it's dimensions info from its header. The generated HTML should be something like:

<img src="pear.png" width="300">

The idea is that the special extract attribute value could be used in various places where meta-data extraction from external assets might be useful. In some media types metadata might play a greater role than others, e.g. containing info about the title, author(s), creation date, license, etc., and sometimes even preview images or cover-art; and the extract feature would allow to re-use a same PML block as a template for each external assets, since only the file name would be required to obtain the value of all other fields, e.g.:

[youtube_video
    yid = NUDhA4hXdS8
    caption = extract(title)
]

i.e. assuming that title is a valid metadata entry name that can be somehow obtained via YouTube API once you known the video's YID.

Surely, the ability to extract metadata and info for all the various possible assets would require implementing a dedicated function for each asset type, and knowing where and how the data is stored. But maybe if PMLC could expose a public API for the extract(x) function for each node involving an external resource, then maybe end users could provide the data extraction code themselves, e.g. via script nodes or some external custom tool invoked via the command line or as a process.

This feature would be very useful when creating catalogues of assets, o handling dynamically generated html pages with lots of images (which might take time to load) which are computed at conversion time (e.g. based on the files present in a given folder), etc.

pml-lang commented 2 years ago

How are you going to implement it, via CLI option or per element via node attributes?

I was already thinking of a general way that would allow to define default values for all attributes of all nodes. Default values could be defined:

in a shared config file, valid for all documents.
in a not-yet-existing config node, a direct child of the doc node (overrides 1)
in CLI arguments (overrides 2)

Each default value can of course be explicitly overridden for each individual node in a document.

Note that script nodes can also be used to define default values, as demonstrated here.

the ability to enable gathering info on an image at conversion time, e.g. to calculate its width and height and inject them in the final HTML in order to ensure that its placeholder (while loading) is of the correct size. ... caption = extract(title)

Very nice idea to optimize and automate!

tajmone commented 2 years ago

The options precedence seems good, but you should also consider introducing a precedence modifier, like the @ symbol in Asciidoctor (see Altering the assignment precedence in Asciidoctor Manual).

This allows to change the default precedence, e.g. by prefixing an option with the @ either in the CLI, the node definition or the settings file. This becomes quite important when dealing with complex projects that share common settings file or invocation commands, and some documents might need the option (or attribute, etc.) defined in the document to have higher precedence over the CLI, and viceversa.

in a not-yet-existing config node, a direct child of the doc node (overrides 1)

So, if I've understood correctly, all the document options will be part of the [doc attributes and/or sub-nodes, along with meta-data, etc. Pandoc uses a YAML section for metadata, placed at the beginning of a document, but then in the actual AST the metadata and options are effectively sub-nodes of the metadata node, so I guess that PML approach is more similar to the latter, since it's always closer to an AST in its syntax.

It makes sense. Then I imagine that ultimately the nodes tree will be something like:

-+- [doc +
         +- [metadata
         +- [options/settings
         + ...

If end users have access to all nodes, dynamically from within the document itself (e.g. to read metadata or settings, or even change them) it could lead to very interesting use cases. E.g. conditional text could be shown depending on a metadata or setting value, and so on (e.g. omit contents if it's a sample/demo version of an eBook).

pml-lang commented 2 years ago

consider introducing a precedence modifier

Good point.

I imagine that ultimately the nodes tree will be something like:
-+- [doc +
+- [metadata
+- [options/settings
+ ...

Yes, exactly.

If end users have access to all nodes ...

Yes, meta- and config-data should be available as maps (dictionaries) in script nodes.

Moreover, for each meta-data there should be an attribute (maybe named show, with sensible default values) to define if the data will be displayed automatically in a meta-data-table at the beginning of the document. E.g.:

[meta
    [author Albert Newton]  [- shown by default -]
    [license (show=yes) MIT]
    [license_URL (show=no) https://opensource.org/licenses/MIT]
]

tajmone commented 2 years ago

Moreover, for each meta-data there should be an attribute (maybe named show, with sensible default values) to define if the data will be displayed automatically in a meta-data-table at the beginning of the document.

That's an excellent idea. It allows to store extra info even if not actually displayed in the document. I don't like the name show though, it's a bit vague. maybe more context-appropriate candidates could be: display, hidden, reveal (although none of them really conveys the document-specific goal).

pml-lang commented 2 years ago

inability to use images stored on the web, or provide links that use relative paths, or protocols others than those accepted by PMLC

Fixed in version 3.0.0.

Moreover, a shared options file, valid for all documents, as well as an option node (a direct child of the doc node) have been added in version 3.0.0.

pml-lang / pml-companion

Links URL and Relative Paths #73