Bidirectional parsing with sgml

Risto-Stevcev commented 1 year ago

I think it would be useful to have bidirectional parsing of HTML/XML, something like:

?- load_html("<code>(foo (+ 2 3))</code>", X, []), load_html(C, X, []).
   error(instantiation_error,load_html/3).

The use-case would be to have a way to build/transform HTML/XML before creating it as a string to use for a server or whatever else.

Currently the library also inserts tags that didn't exist in the string, so that might need to be addressed as part of it:

?- load_html("<code>(foo (+ 2 3))</code>", X, []).
   X = [element(html,[],[element(head,[],[]),element(body,[],[element(code,[],["(foo (+ 2 3 ..."])])])].

triska commented 1 year ago

I think one way to avoid the unwanted tags is to use load_xml/2 instead of load_html/2. For example, we get:

?- load_xml("(foo (+ 2 3))", DOM, []).
   DOM = [element(code,[x="123",b="cde"],["(foo (+ 2 3))"])].

We can use a DCG to relate such a DOM representation to a list of characters:

:- use_module(library(dcgs)).
:- use_module(library(format)).

elements_string([]) --> [].
elements_string([E|Es]) -->
        element_string(E),
        elements_string(Es).

element_string([C|Cs]) --> seq([C|Cs]).
element_string(element(Name, Attrs, Cs)) -->
        format_("<~w", [Name]),
        attributes(Attrs),
        ">\n",
        elements_string(Cs),
        format_("~n<~w>~n", [Name]).

attributes([]) --> [].
attributes([A|As]) --> " ", attributes_([A|As]).

attributes_([]) --> [].
attributes_([Name=Value|As]) -->
        format_("~w=\"~s\"", [Name,Value]),
        attributes(As).

Yielding:

?- load_xml("<code x=\"123\" b=\"cde\">(foo (+ 2 3))</code>", DOM, []),
   phrase(elements_string(DOM), Cs).
   DOM = [element(code,[x="123",b="cde"],["(foo (+ 2 3))"])],
   Cs = "<code x=\"123\" b=\"cde\">\n(foo (+ 2 3))\n<code>\n".

Emitting it with format/2 yields:

?- load_xml("<code x=\"123\" b=\"cde\">(foo (+ 2 3))</code>", DOM, []),
   phrase(elements_string(DOM), Cs),
   format("~s", [Cs]).
<code x="123" b="cde">
(foo (+ 2 3))
<code>
...

Does this help?

Note that load_html/2 and load_xml/2 support several different sources in addition to lists of characters, so converting a DOM to only a list of characters would be incomplete.

Risto-Stevcev commented 1 year ago

Yeah, thanks! that's really helpful.

Is there a way to run the DCG example in the reverse direction, something like phrase(elements_string(DOM), "<code x=\"123\" b=\"cde\">\n(foo (+ 2 3))\n<code>\n")?
Sorry if it's a dumb question, I admittedly need to brush up on DCGs. I thought they were bidirectional because they're syntactic sugar on regular relations, but it just threw a list of single character strings in an infinite loop. I'm wondering because I might need to port it to Tau Prolog for the frontend, and the sgml library uses Rust internals.

Unrelated: I really like your Power of Prolog series a lot on your site and Youtube. I love Prolog, and it inspires me to work with it a lot more.

triska commented 1 year ago

Thank you, thank you, I am glad you find the material useful!

Parsing HTML is harder than generating it from the DOM, so Tau Prolog may benefit from similar engine-powered facilities to easily parse HTML. An alternative may be to use the newly available WASM port of Scryer Prolog for the frontend, please see https://github.com/mthom/scryer-prolog/discussions/2005 for more information!

mthom / scryer-prolog

Bidirectional parsing with sgml #2002