Closed ghost closed 7 years ago
Hi! Sorry for ignoring the issue for about a year. Unfortunatelly github didn't send me notifications about new issues after previous maintainer rerooted the repo. It looks like it was he, who get all those emails.
I'll try to handle all the issues in the near time.
Actually, it is not that clear how it can looks like:
How should it work with <html>outside1 <b>inside</b> outside2</html>
? Which text should be in the node? outside1 outside2
? 😎
Also, you can do something like:
outside = r"html"
inside = r"html b"
for k,v in pairs(inside) do
print(v:getcontent())
end
for k,v in pairs(outside) do
print(v:getcontent())
end
And in you particular case in the initial question, you can just strip tags from html
node with something like :gsub("<(.-)>(.-)<%1/>","")
applied on getcontent()
in output loop.
P.S. I'll close issue for now, but don't hesitate to write your questions here.
First off, it's amazing and whoever made this is amazing too
Now, I encountered a small problem. I apologize because I am not knowledgeable with web technologies and my vocabulary is probably incorrect so I hope you'll understand what I mean
If you try to parse this:
You won't be able to extract most of the content.
If you take this line
<em class='bbc'>banana</em>
, you will be able to extract the content (banana
) withnode:gettext()
but you won't be able to extract anything that isn't inside a tag at all (I believe we call that a text node ?)For example, in this html code:
<html><b>inside</b> outside</html>
You'll be able to extract the word
inside
as it's inside the tag so it's going to be in a node, but not the wordoutside
despite the fact that both of these words will be displayed on modern browsers and thus, are both important.I believe that "outside" should also be put in a node, just a node with an empty "name" field. Or call it a text node maybe.
Maybe I missed something, but I couldn't find how to extract most of the content in the html further above.
This makes it difficult when trying to convert an html document to plain text, as all web browsers actually DO display these.