Stripping parts of extracted html

huginn / huginn

Create agents that monitor and act on your behalf. Your agents are standing by!

MIT License

43.81k stars 3.79k forks source link

Stripping parts of extracted html #2527

Open animusastralis opened 5 years ago

animusastralis commented 5 years ago

I'm creating a scenario that would let me 1) generate RSS (via Website agent by extracting URLs) then 2) scrap html of corresponding articles (also via Website agent) and finally 3) emit full-text RSS (via DataOutput agent).

I struggle with stripping unwanted parts from the extracted article. For example, I use xpath to extract article body:

    "description": {
      "xpath": "//div[@class=\"gutter-left mobile-zero\"]",
      "value": "."

It happens to contain <div class=\"visuallyhidden no-print\">some text</div> part at the end which then shows up in DataOutput agent.

Is there an option to completely strip this part? Maybe using some xpath function?

dsander commented 5 years ago

Do you want to strip all HTML tags? normalize-space(.) or string(.) do that.

If you just want to remove that one specific div it's probably the easiest to to it in a liquid replace filter in either the template option of the WebsiteAgent or a EventFormattingAgent.

animusastralis commented 5 years ago

Do you want to strip all HTML tags? normalize-space(.) or string(.) do that. If you just want to remove that one specific div it's probably the easiest to to it in a liquid replace filter in either the template option of the WebsiteAgent or a EventFormattingAgent.

Yes, I know that these functions strip all html tags. And I want to find a way to strip html tags with content inside them in order to remove unwanted parts like ads, links to related articles, etc.

Now, I've suspected that template is an option I would probably need, yet particular implementation is unclear to me. For instance, I have an event with payload that looks like:

{
  "title": "ARTICLE TITLE",
  "date_published": "21.04.2019",
  "author": [
    "ARTICLE AUTHOR"
  ],
  "description": "<div class=\"article-body\">ARTICLE TEXT<div class=\"ads\">AD TEXT<\/div><\/div>",
  "url": "https://example.com/article"
}

How would you strip <div class=\"ads\">AD TEXT<\/div> from this payload?

dsander commented 5 years ago

You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.

animusastralis commented 5 years ago

You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.

Thanks for your help, I think I've achieved what I was aiming for. Considering that ad blocks almost always have a unique classname, regex_replace + template option should work well enough.

So I've added template option:

  "template": {
    "description": "{{ description | regex_replace: '<div class=\\x22ads\\x22>(.|\n)*?</div>', '' }}"
  }

It doesn't look very nice but it works. If there is a way to make a nicer expression I'll always be glad to see it!