msva / lua-htmlparser

An HTML parser for lua.
231 stars 44 forks source link

Extract text node only #47

Closed stejacob closed 7 years ago

stejacob commented 7 years ago

By using your library, is is possible to extract only the text elements from an HTML document?

For example: <p>This is <strong>a typical</strong> line of <em>text</em></p>

The result would be: This is a typical line of text

I was able to create a recursive function to loop through each elements, but not sure where to go from here to extract text elements only.

Thank you.

msva commented 7 years ago

You can do that with something like:

html=require"htmlparser"
t=html.parse("<p>This is <strong>a typical</strong> line of <em>text</em></p>")
textonly=t:gettext():gsub("<[^>]*>","")
print(textonly)

Although, there were requests to implement that functionality as library function, and I still not sure if we should.

stejacob commented 7 years ago

Thanks for your answer.

It would be great if we could loop through each xml element and filter its node type like in JQuery. Your solution does works for me though. But if you do get that question often, it might be useful to provide a simple function in your library. Thanks for the great work.

Example in JQuery: var textList = root.contents().filter(function() { return this.nodeType == 3; });

Regards.