Closed andyrozman closed 5 months ago
Hi @andyrozman
Can you share the URL to the RSS feed?
Here is an issue reported with response status code 403. It was resolved by setting the http user agent.
I don't want to post it public forum, but of you could send me email (see my profile), I can send it through email. Setting User Agent didn't help... When I was debuging my problem with parser, I noticed that site required cookies and also JavaScript enabled.
Thanks for the URL. Unfortunately I was not able get around the status code 403 problem.
I have had this problem when scraping web pages but not for RSS feeds.
The page source / rss feed extracted using Selenium can be passed as an InputStream to RssReader.
String pageSource = "...";
InputStream inputStream = new ByteArrayInputStream(pageSource.getBytes(StandardCharsets.UTF_8));
List<Item> items = new RssReader().read(inputStream).collect(Collectors.toList());
That was my thought too, to try this... But now I am facing different problem, I am using Firefox as Selenium Driver, and I can't get it to configure to open XML (what RSS feed is) in browser, it always downloads it). Tried with view-source which has its own problem... And even if we get this to work, it seems to show only last 5 items in the feed... I don't know how, but when I use Feedly it shows me a lot more items... It seems to be doing some "magic" in the background...
As far as I know there is nothing in the RSS specification that the client can use to control how many items that is published in the RSS feed.
It is a java script that renders the RSS feed that is why it can't be accessed with a HttpClient or JSoup.
When using the Firefox Selenium driver add prefix view-source:
to the URL or use the Chrome driver.
// Download page source
WebDriver driver = new FirefoxDriver();
driver.get("view-source:https://...");
String pageSource = driver.getPageSource();
driver.quit();
// Extract RSS feed from page source
Document doc = Jsoup.parse(pageSource);
String rssFeed = doc.select("pre").text();
// Parse RSS feed
InputStream inputStream = new ByteArrayInputStream(rssFeed.getBytes(StandardCharsets.UTF_8));
List<Item> items = new RssReader().read(inputStream).collect(Collectors.toList());
Thank you for your help. We can close this issue...
Hi ! I am trying to get to specific RSS feed, but I get 403 error back... When trying to access the same website with JSoup parser, I also get same error (I had to use Selenium in the end to be able to parse the pages)... Did you have similar problem anytime before. Andy