Closed barolo closed 3 years ago
Hello @barolo.
Can you write some more ellaborate step-by-step description of how the feature should work? I am just not sure if I understand what you actually want. Sorry. :( :)
As for processing abilities, you can for sure use all of the above JS libraries either from message filters or in message/web scraping.
I could relatively easily write for example Python plugin which will (optionally) download a set of URLs (either scraped or obtained via parameters), then process its contents via e. g. readability and go from there.
I wrote some extra scrapers in RSS Guard 4.x
branch.
For example this script which downloads "In the news" articles from main page of Wikipedia. You could easily rewrite the script to actually download full sub-pages of those articles and feed those articles to Readability or some other solution.
This might be what I wanted, I'm gona fiddle with examples for scrapping. Are these only for 4.x for some reason?
@barolo No, they should work with 3.9.0+ just fine. They are in the 4.x branch just because I had the branch active when I was writing those scrapers. They can work even outside of RSS Guard.
Anway, test it and let me know please if it is OK or not.
There is a Python package
https://pypi.org/project/readability-lxml/
which might be easily used via Python script/scraper.
I'm a bit confused as to how I can feed arrticle url, from the feed to newspaper for example. Is there some placeholder for article url rss guard provides?
@barolo Note that this is not some basic stuff, you have to have some basic command line (terminal) capabilities.
In general, "scrapers" are scripts which take some input and produce valid RSS/ATOM/JSON contents to output.
Some scrapers also may take some input "parameters" which tweak scraper's behavior. For example you may simply test "wiki-inthenews.py" scraper in your command line (cmd.exe, Powershell, Bash) by running it with Python interpreter: python .\\wiki-inthenews.py
. The script will fork for some seconds and then will print out RAW JSON data of the desired feed.
This is how this type of scraper looks like when used from RSS Guard:
You see that "source" is "Script". In other words, script does all the work. I downloads needed files and produces final JSON feed.
Some other scripts have to be fed with the data from other source. I call these "post-process" scripts because they do not usually download the data by themselves, they get some data, do some magic on it and produce result, for example: curl 'https://phys.org/rss-feed/' | python ./translate-rss2.py "en" "pt_BR" "true"
. This command downloads some feed from given URL, then feeds the data to scraper, scraper then localizes data to another language.
This is how this looks in RSS Guard:
You see the source is "URL", in other words RSS Guards downloads needed file, and script is used then as "Post-process script". RSS Guard automatically feeds RAW downloaded data to the script.
As for your "newspaper" request. Here is the screenshot:
I pushed new version of the scraper which uses "newspaper" python package to parse/scraper simplified HTML content for articles.
Here is the diff: https://github.com/martinrotter/rssguard/commit/ae0fa64318a0ef1af79af774ffd37bce839cb361#diff-5a72a8e5a1c41d88d6ae5a7b3c248b227acd54840e1ed05b983c2f813177d7a8
Yes, that's what I meant, more or less. Would it be possible to 'generalize' that script i.e. to apply it to any "normal'' feed? Because in this Wikipedia scraper it creaters feed first doesn't it? Sorry, coding isn't my strong side ;)
@barolo So what would should the script do step-by-step:
newspaper
in this case).Like this?
Yes, more or less.
Let's focus on proper feeds, which already have article urls, don't they?
But let's say they don't provide valuable content, or severely cut.
I'd want to wring them through newspaper before displaying them
@barolo Yes, OK, I will make some scraper for RSS 2.0 feeds for you. Maybe tonight. Gimme some days, OK?
No biggie, take your time. I'd want to make it as simple as possible.
I'd want to make it as generic as possible for reuse with other extractors since some are really powerful and you'll never have to leave the reader with them pretty much
@barolo I wrote the scraper for you.
https://github.com/martinrotter/rssguard/blob/version-4/resources/scripts/scrapers/scrape-rss2.py
What it does:
I tested script with feed http://rss.cnn.com/rss/edition.rss
which works nicely. Screenshot below shows feed with full articles scraped via the script. Note that the library newspaper3k
used in this script seems to be not so well up-to-date as it does not simply work fine with some websites. You have to try and see and tweak scripts to suit your needs exactly.
@barolo I added another variant of the scraper which uses article-parser
https://github.com/martinrotter/rssguard/blob/version-4/resources/scripts/scrapers/scrape-as-rss2.py
Screen:
@martinrotter Would you be so kind and show me how the command should look in the Scripts input field? I have a hard time getting it right
@barolo First, read full documentation page: https://github.com/martinrotter/rssguard/blob/version-4/resources/docs/Documentation.md#websites-scraping
It is all written there. For this particular scraper, you would need to make sure you have "Source" se to "URL", for example http://feeds.bbci.co.uk/news/england/rss.xml
Then you need to set "post-process script" to:
python#c:\path\to\your\scraper\scrape-as-rss2.py#16
where 16
is number of threads used to scrape articles, number between 2 - 16 would be fine.
You also have to have Python installed and added to your "PATH" environment variable.
I'm on Linux, so interpreters can be omitted altogether most of the time as they're picked automagically. Gonna try it a bit later. Thanks!
Brief description of the feature request.
There is radibility.js, its various implementations, some for cli, which powers Firefox's page readability feature (extracts only meaningful content discarding trash), Article-Parser, and also mercury, parser/extractor with its cli, which is even more powerful, with wider coverage.
We have pre/post processing ability per rss (which I love), what about per url for such extractors as ones mentioned above?