w3stling / rssreader

A simple Java library for reading RSS and Atom feeds
MIT License
152 stars 25 forks source link

Trying to access certain RSS feed, but get 403 #153

Closed andyrozman closed 5 months ago

andyrozman commented 5 months ago

Hi ! I am trying to get to specific RSS feed, but I get 403 error back... When trying to access the same website with JSoup parser, I also get same error (I had to use Selenium in the end to be able to parse the pages)... Did you have similar problem anytime before. Andy

w3stling commented 5 months ago

Hi @andyrozman

Can you share the URL to the RSS feed?

Here is an issue reported with response status code 403. It was resolved by setting the http user agent.

andyrozman commented 5 months ago

I don't want to post it public forum, but of you could send me email (see my profile), I can send it through email. Setting User Agent didn't help... When I was debuging my problem with parser, I noticed that site required cookies and also JavaScript enabled.

w3stling commented 5 months ago

Thanks for the URL. Unfortunately I was not able get around the status code 403 problem.

I have had this problem when scraping web pages but not for RSS feeds.

The page source / rss feed extracted using Selenium can be passed as an InputStream to RssReader.

String pageSource = "...";
InputStream inputStream = new ByteArrayInputStream(pageSource.getBytes(StandardCharsets.UTF_8));
List<Item> items = new RssReader().read(inputStream).collect(Collectors.toList());
andyrozman commented 5 months ago

That was my thought too, to try this... But now I am facing different problem, I am using Firefox as Selenium Driver, and I can't get it to configure to open XML (what RSS feed is) in browser, it always downloads it). Tried with view-source which has its own problem... And even if we get this to work, it seems to show only last 5 items in the feed... I don't know how, but when I use Feedly it shows me a lot more items... It seems to be doing some "magic" in the background...

w3stling commented 5 months ago

As far as I know there is nothing in the RSS specification that the client can use to control how many items that is published in the RSS feed.

w3stling commented 5 months ago

It is a java script that renders the RSS feed that is why it can't be accessed with a HttpClient or JSoup.

When using the Firefox Selenium driver add prefix view-source: to the URL or use the Chrome driver.

// Download page source
WebDriver driver = new FirefoxDriver();
driver.get("view-source:https://...");
String pageSource = driver.getPageSource();
driver.quit();

// Extract RSS feed from page source
Document doc = Jsoup.parse(pageSource);
String rssFeed = doc.select("pre").text();

// Parse RSS feed
InputStream inputStream = new ByteArrayInputStream(rssFeed.getBytes(StandardCharsets.UTF_8));
List<Item> items = new RssReader().read(inputStream).collect(Collectors.toList());
andyrozman commented 5 months ago

Thank you for your help. We can close this issue...