Does not pull links from its own blog post

ModestMC commented 3 years ago

So, this is super awesome. I was actually going to email you to ask some questions about how the results get stored and handled (eg. adding a count for how many times a particular page showed up) but I decided I would start by trying to use the Heroku App.

I tried feeding in the blog post about the project, and even though there were links on that specific page it didn't pull them. I assume for this thing to work you need to point it at the main home page, but I thought this might be a bug.

I'd love to fiddle with handling the results, this is a super sweet idea. If this thing searched pages to a certain depth of recursion (or maybe just used BFS), and then have it do a frequency count, you'd be able to see the shape of the overall network and get a lazy ranking system. Cheers!

yasamnoya commented 3 years ago

Pretty awesome project. Just another potential bug, Some <description>s in RSS feeds may not contain the full content of the blog post. Does this engine parse the links only appear in html?

quakkels commented 3 years ago

@ModestMC,

You are correct that it wont pull link from the web page itself. It was intended to work with RSS feeds found on the page. And, embarrassingly, it looks like my Hugo theme is broken, and does not correctly fill in the href for the rss feed on any page other than the homepage.

So this appears to be a bug, though not with RSS Discovery Engine itself. Thanks for pointing it out.

ModestMC commented 3 years ago

No worries! This actually raises a really interesting point. A while back, I was trying to scrape a page to get a bunch of images of Gwent cards. However, the XML tree for the page was... ugly, to say the least. It might be smart to try a few workarounds for scraping feeds before calling a page a wash. As for which ones, if you can get the thing to record which errors it's seeing, maybe some common mistakes appear.

quakkels / rssdiscoveryengine

Does not pull links from its own blog post #13