jessealama / laramie

HTML5 parser for Racket
MIT License
6 stars 3 forks source link

parsing HTML file #1

Open arademaker opened 2 years ago

arademaker commented 2 years ago

Hi @jessealama, thanks for the package. Can you provide a simple example of how to parse a local HTML file with the Laramie? I didn't see in the tests anything. Documentation is not available yet, right?

My guess was:

(define doc
    (port->string (open-input-file "/Users/ar/work/propbank/up-t/UP_Portuguese-Bosque/verbs/votar.html")
                  #:close? #t))

(define state (parse doc))
jessealama commented 2 years ago

Hey there! Somehow I missed this message entirely.

Yes, documentation is sadly missing. Your question is a good one. It looks like you found a solution, though? Using parse is the right approach. You get a document? back (also defined in Laramie). Perhaps you can tell me a bit more about what you're trying to do? Extract links, check that something is missing (or present)?

arademaker commented 2 years ago

I was trying to convert the HTML documents into XML valid documents according to a predefined DTD/XML Schema. In particular, the HTML files from [1] to XML files like [2].

[1] https://github.com/UniversalPropositions/UP_Portuguese-Bosque/tree/main/verbs [2] https://github.com/propbank/propbank-frames/blob/main/frames/absorb.xml

arademaker commented 2 years ago

The problem was not especially difficult... but the HTML is invalid, so I need a library robust to it

image

jessealama commented 2 years ago

Ah, I see. There used to be an HTML-to-XML converter in Laramie. The funny thing is, that old code is still there, but it is not provide'd. I will open a feature request issue to make make that code accessible. It seems to me that once you've got an xexpr? value, you can massage it according to your needs. It sounds like you've got a schema you need to work with that requires some custom dropping/slicing/rearranging?

jessealama commented 2 years ago

Now that I look more closely, I can see that there is indeed some code that might do what you want (at least, the conversion from an HTML file to some kind of HTML representation, and then to XML). It's a combination of Laramie's ->html function with html->xml. The return value of html->xml is an an XML document?, in the sense of the Racket xml package. Does this help?

Perhaps I should make the html->xml function more convenient to use by allowing it to take a string (or byte string), or maybe even an input port, as an argument. Then I think you get what you want without much hassle.