Open Risto-Stevcev opened 1 year ago
I think one way to avoid the unwanted tags is to use load_xml/2
instead of load_html/2
. For example, we get:
?- load_xml("(foo (+ 2 3))
", DOM, []).
DOM = [element(code,[x="123",b="cde"],["(foo (+ 2 3))"])].
We can use a DCG to relate such a DOM representation to a list of characters:
:- use_module(library(dcgs)). :- use_module(library(format)). elements_string([]) --> []. elements_string([E|Es]) --> element_string(E), elements_string(Es). element_string([C|Cs]) --> seq([C|Cs]). element_string(element(Name, Attrs, Cs)) --> format_("<~w", [Name]), attributes(Attrs), ">\n", elements_string(Cs), format_("~n<~w>~n", [Name]). attributes([]) --> []. attributes([A|As]) --> " ", attributes_([A|As]). attributes_([]) --> []. attributes_([Name=Value|As]) --> format_("~w=\"~s\"", [Name,Value]), attributes(As).
Yielding:
?- load_xml("<code x=\"123\" b=\"cde\">(foo (+ 2 3))</code>", DOM, []), phrase(elements_string(DOM), Cs). DOM = [element(code,[x="123",b="cde"],["(foo (+ 2 3))"])], Cs = "<code x=\"123\" b=\"cde\">\n(foo (+ 2 3))\n<code>\n".
Emitting it with format/2
yields:
?- load_xml("<code x=\"123\" b=\"cde\">(foo (+ 2 3))</code>", DOM, []), phrase(elements_string(DOM), Cs), format("~s", [Cs]). <code x="123" b="cde"> (foo (+ 2 3)) <code> ...
Does this help?
Note that load_html/2
and load_xml/2
support several different sources in addition to lists of characters, so converting a DOM to only a list of characters would be incomplete.
Yeah, thanks! that's really helpful.
Is there a way to run the DCG example in the reverse direction, something like phrase(elements_string(DOM), "<code x=\"123\" b=\"cde\">\n(foo (+ 2 3))\n<code>\n")
?
Sorry if it's a dumb question, I admittedly need to brush up on DCGs. I thought they were bidirectional because they're syntactic sugar on regular relations, but it just threw a list of single character strings in an infinite loop. I'm wondering because I might need to port it to Tau Prolog for the frontend, and the sgml library uses Rust internals.
Unrelated: I really like your Power of Prolog series a lot on your site and Youtube. I love Prolog, and it inspires me to work with it a lot more.
Thank you, thank you, I am glad you find the material useful!
Parsing HTML is harder than generating it from the DOM, so Tau Prolog may benefit from similar engine-powered facilities to easily parse HTML. An alternative may be to use the newly available WASM port of Scryer Prolog for the frontend, please see https://github.com/mthom/scryer-prolog/discussions/2005 for more information!
I think it would be useful to have bidirectional parsing of HTML/XML, something like:
The use-case would be to have a way to build/transform HTML/XML before creating it as a string to use for a server or whatever else.
Currently the library also inserts tags that didn't exist in the string, so that might need to be addressed as part of it: