msva / lua-htmlparser

An HTML parser for lua.
231 stars 44 forks source link

Extracting the word "outside" in <html><b>inside</b> outside</html> #44

Closed ghost closed 7 years ago

ghost commented 8 years ago

First off, it's amazing and whoever made this is amazing too

Now, I encountered a small problem. I apologize because I am not knowledgeable with web technologies and my vocabulary is probably incorrect so I hope you'll understand what I mean

If you try to parse this:

<div class='post entry-content '>
<!-- google_ad_section_start -->
 <span style='font-size: 48px;'><span style='font-family: courier new,courier,monospace'>Krist</span></span><br/>
<br/>
Krist is a currency that operates across servers (and in singleplayer). The installer is on the bottom of this post.<br/>
<br/>
Users can send KST to eachother via Krist Addresses, a ten character string that is led by a lowercase <em class='bbc'>k</em>. This is an example of a Krist Address: kg5dc1lzo0<br/>
<br/>
To put KST into circulation, it has to be mined. This involves lots of work done by computers, and means that I can&#39;t just &quot;spawn in&quot; as much KST as I want. I have to mine it like everyone else.<br/>
Initially, KST was mined by in-game computers, but now requires external software.<br/>
<br/>
<strong class='bbc'>Wallet installer:</strong> <pre class='prettyprint'>pastebin run Yv0fChz5</pre>
<br/>
Please post your questions, feedback, insight and Krist Addresses&#33; <strong class='bbc'>There is documentation for every node API call in my profile.</strong>
<!-- google_ad_section_end -->
<br/>
<p class='edit'>
<strong>Edited by 3d6, 14 February 2016 - 04:46 PM.</strong>
</p>
</div>

You won't be able to extract most of the content.

If you take this line <em class='bbc'>banana</em>, you will be able to extract the content (banana) with node:gettext() but you won't be able to extract anything that isn't inside a tag at all (I believe we call that a text node ?)

For example, in this html code: <html><b>inside</b> outside</html>

You'll be able to extract the word inside as it's inside the tag so it's going to be in a node, but not the word outside despite the fact that both of these words will be displayed on modern browsers and thus, are both important.

I believe that "outside" should also be put in a node, just a node with an empty "name" field. Or call it a text node maybe.

Maybe I missed something, but I couldn't find how to extract most of the content in the html further above.

This makes it difficult when trying to convert an html document to plain text, as all web browsers actually DO display these.

msva commented 7 years ago

Hi! Sorry for ignoring the issue for about a year. Unfortunatelly github didn't send me notifications about new issues after previous maintainer rerooted the repo. It looks like it was he, who get all those emails.

I'll try to handle all the issues in the near time.

msva commented 7 years ago

Actually, it is not that clear how it can looks like:

How should it work with <html>outside1 <b>inside</b> outside2</html>? Which text should be in the node? outside1 outside2? 😎

Also, you can do something like:

outside = r"html"
inside = r"html b"

for k,v in pairs(inside) do
    print(v:getcontent())
end

for k,v in pairs(outside) do
    print(v:getcontent())
end

And in you particular case in the initial question, you can just strip tags from html node with something like :gsub("<(.-)>(.-)<%1/>","") applied on getcontent() in output loop.

P.S. I'll close issue for now, but don't hesitate to write your questions here.