[FR]: Parsers, extractors for easier life life.

barolo commented 3 years ago

Brief description of the feature request.

There is radibility.js, its various implementations, some for cli, which powers Firefox's page readability feature (extracts only meaningful content discarding trash), Article-Parser, and also mercury, parser/extractor with its cli, which is even more powerful, with wider coverage.

We have pre/post processing ability per rss (which I love), what about per url for such extractors as ones mentioned above?

martinrotter commented 3 years ago

Hello @barolo.

Can you write some more ellaborate step-by-step description of how the feature should work? I am just not sure if I understand what you actually want. Sorry. :( :)

As for processing abilities, you can for sure use all of the above JS libraries either from message filters or in message/web scraping.

I could relatively easily write for example Python plugin which will (optionally) download a set of URLs (either scraped or obtained via parameters), then process its contents via e. g. readability and go from there.

I wrote some extra scrapers in RSS Guard 4.x branch.

For example this script which downloads "In the news" articles from main page of Wikipedia. You could easily rewrite the script to actually download full sub-pages of those articles and feed those articles to Readability or some other solution.

barolo commented 3 years ago

This might be what I wanted, I'm gona fiddle with examples for scrapping. Are these only for 4.x for some reason?

martinrotter commented 3 years ago

@barolo No, they should work with 3.9.0+ just fine. They are in the 4.x branch just because I had the branch active when I was writing those scrapers. They can work even outside of RSS Guard.

Anway, test it and let me know please if it is OK or not.

There is a Python package

https://pypi.org/project/readability-lxml/

which might be easily used via Python script/scraper.

Also: https://github.com/codelucas/newspaper

barolo commented 3 years ago

I'm a bit confused as to how I can feed arrticle url, from the feed to newspaper for example. Is there some placeholder for article url rss guard provides?

martinrotter commented 3 years ago

@barolo Note that this is not some basic stuff, you have to have some basic command line (terminal) capabilities.

In general, "scrapers" are scripts which take some input and produce valid RSS/ATOM/JSON contents to output.

Some scrapers also may take some input "parameters" which tweak scraper's behavior. For example you may simply test "wiki-inthenews.py" scraper in your command line (cmd.exe, Powershell, Bash) by running it with Python interpreter: python .\\wiki-inthenews.py. The script will fork for some seconds and then will print out RAW JSON data of the desired feed.

This is how this type of scraper looks like when used from RSS Guard:

You see that "source" is "Script". In other words, script does all the work. I downloads needed files and produces final JSON feed.

Some other scripts have to be fed with the data from other source. I call these "post-process" scripts because they do not usually download the data by themselves, they get some data, do some magic on it and produce result, for example: curl 'https://phys.org/rss-feed/' | python ./translate-rss2.py "en" "pt_BR" "true". This command downloads some feed from given URL, then feeds the data to scraper, scraper then localizes data to another language.

This is how this looks in RSS Guard:

You see the source is "URL", in other words RSS Guards downloads needed file, and script is used then as "Post-process script". RSS Guard automatically feeds RAW downloaded data to the script.

martinrotter commented 3 years ago

As for your "newspaper" request. Here is the screenshot:

I pushed new version of the scraper which uses "newspaper" python package to parse/scraper simplified HTML content for articles.

https://github.com/martinrotter/rssguard/blob/version-4/resources/scripts/scrapers/wiki-inthenews.py#L25

Here is the diff: https://github.com/martinrotter/rssguard/commit/ae0fa64318a0ef1af79af774ffd37bce839cb361#diff-5a72a8e5a1c41d88d6ae5a7b3c248b227acd54840e1ed05b983c2f813177d7a8

barolo commented 3 years ago

Yes, that's what I meant, more or less. Would it be possible to 'generalize' that script i.e. to apply it to any "normal'' feed? Because in this Wikipedia scraper it creaters feed first doesn't it? Sorry, coding isn't my strong side ;)

martinrotter commented 3 years ago

@barolo So what would should the script do step-by-step:

Get som URL, for example "https://archlinux.org".
Download all articles from that URL with some magic (with newspaper in this case).
Produce valid JSON feed.

Like this?

barolo commented 3 years ago

Yes, more or less. Let's focus on proper feeds, which already have article urls, don't they?
But let's say they don't provide valuable content, or severely cut.
I'd want to wring them through newspaper before displaying them

martinrotter commented 3 years ago

@barolo Yes, OK, I will make some scraper for RSS 2.0 feeds for you. Maybe tonight. Gimme some days, OK?

barolo commented 3 years ago

No biggie, take your time. I'd want to make it as simple as possible.

We have valid, feed.
Pass articles to extractor
Get them back and display

I'd want to make it as generic as possible for reuse with other extractors since some are really powerful and you'll never have to leave the reader with them pretty much

martinrotter commented 3 years ago

@barolo I wrote the scraper for you.

https://github.com/martinrotter/rssguard/blob/version-4/resources/scripts/scrapers/scrape-rss2.py

What it does:

It decodes provided XML RSS 2.0 data.
It reads hyperlink of each article, downloads its source webpage, parses its contents and replaces the original short text in feed with full article text/html.
It returns modified feed data to output.

I tested script with feed http://rss.cnn.com/rss/edition.rss which works nicely. Screenshot below shows feed with full articles scraped via the script. Note that the library newspaper3k used in this script seems to be not so well up-to-date as it does not simply work fine with some websites. You have to try and see and tweak scripts to suit your needs exactly.

martinrotter commented 3 years ago

@barolo I added another variant of the scraper which uses article-parser

https://github.com/martinrotter/rssguard/blob/version-4/resources/scripts/scrapers/scrape-as-rss2.py

Screen:

barolo commented 3 years ago

@martinrotter Would you be so kind and show me how the command should look in the Scripts input field? I have a hard time getting it right

martinrotter commented 3 years ago

@barolo First, read full documentation page: https://github.com/martinrotter/rssguard/blob/version-4/resources/docs/Documentation.md#websites-scraping

It is all written there. For this particular scraper, you would need to make sure you have "Source" se to "URL", for example http://feeds.bbci.co.uk/news/england/rss.xml

Then you need to set "post-process script" to:

python#c:\path\to\your\scraper\scrape-as-rss2.py#16

where 16 is number of threads used to scrape articles, number between 2 - 16 would be fine.

martinrotter commented 3 years ago

You also have to have Python installed and added to your "PATH" environment variable.

barolo commented 3 years ago

I'm on Linux, so interpreters can be omitted altogether most of the time as they're picked automagically. Gonna try it a bit later. Thanks!

martinrotter / rssguard

[FR]: Parsers, extractors for easier life life. #399

Brief description of the feature request.