Open animusastralis opened 5 years ago
Do you want to strip all HTML tags? normalize-space(.)
or string(.)
do that.
If you just want to remove that one specific div
it's probably the easiest to to it in a liquid replace
filter in either the template
option of the WebsiteAgent or a EventFormattingAgent.
Do you want to strip all HTML tags?
normalize-space(.)
orstring(.)
do that. If you just want to remove that one specificdiv
it's probably the easiest to to it in a liquidreplace
filter in either thetemplate
option of the WebsiteAgent or a EventFormattingAgent.
Yes, I know that these functions strip all html tags. And I want to find a way to strip html tags with content inside them in order to remove unwanted parts like ads, links to related articles, etc.
Now, I've suspected that template
is an option I would probably need, yet particular implementation is unclear to me. For instance, I have an event with payload that looks like:
{
"title": "ARTICLE TITLE",
"date_published": "21.04.2019",
"author": [
"ARTICLE AUTHOR"
],
"description": "<div class=\"article-body\">ARTICLE TEXT<div class=\"ads\">AD TEXT<\/div><\/div>",
"url": "https://example.com/article"
}
How would you strip <div class=\"ads\">AD TEXT<\/div>
from this payload?
You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.
You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.
Thanks for your help, I think I've achieved what I was aiming for. Considering that ad blocks almost always have a unique classname, regex_replace
+ template
option should work well enough.
So I've added template
option:
"template": {
"description": "{{ description | regex_replace: '<div class=\\x22ads\\x22>(.|\n)*?</div>', '' }}"
}
It doesn't look very nice but it works. If there is a way to make a nicer expression I'll always be glad to see it!
I'm creating a scenario that would let me 1) generate RSS (via Website agent by extracting URLs) then 2) scrap html of corresponding articles (also via Website agent) and finally 3) emit full-text RSS (via DataOutput agent).
I struggle with stripping unwanted parts from the extracted article. For example, I use xpath to extract article body:
It happens to contain
<div class=\"visuallyhidden no-print\">some text</div>
part at the end which then shows up in DataOutput agent.Is there an option to completely strip this part? Maybe using some xpath function?