pictuga / morss

Get full text RSS feeds
https://morss.it/
GNU Affero General Public License v3.0
621 stars 75 forks source link

CLI version doesn't grab all articles, unlike web version #54

Closed vosian closed 2 years ago

vosian commented 3 years ago

Hello, and sorry if it's a mistake on my part, but when trying to make a rss feed for https://shonumi.github.io/articles.html it only ends up grabbing the first article. I'm using the following command morss --items "//*[class=inner_text_large]" https://shonumi.github.io/articles.html

Using the website and selecting that element selects 5 articles. For some reason the cli version is stopping at the first one.

My version is current as I installed morss today.

vosian commented 3 years ago

I tried with another site and got the same problem, morss only grabs one "article", command used: morss --items "//*[class=bz_comment]" "https://bugzilla.kernel.org/show_bug.cgi?id=60824"

pictuga commented 3 years ago

maybe try setting up caching (sqlite will probably do if you have a small installation) and/or increase MAX_TIME/ITEM? see https://git.pictuga.com/pictuga/morss#environment-variables

vosian commented 3 years ago

Adding CACHE=sqlite MAX_ITEM=10 MAX_TIME=120 changed nothing, I'm still getting a single article.

I added DEBUG=1 (CACHE=sqlite MAX_ITEM=10 MAX_TIME=120 DEBUG=1 morss --items "//*[class=bz_comment]" "https://bugzilla.kernel.org/show_bug.cgi?id=60824") as well and got the following: error.txt

It caught my attention that there are 171 lines of "dropped", I went to check on the site, and by using document.getElementsByClassName("bz_comment") on the browser I could see that there are 172 comments in total, so I figure each "dropped" represents an "article" that's being ignored for some reason.

vosian commented 3 years ago

morss --items "/html/body/div[4]/div/main/div[2]/div/div[2]/div/div/div/div/h4/a" "https://github.com/pictuga/morss/tags" this also gives a single result, I don't know if there's something I'm messing up on my side, but I don't understand why morss is dropping all but 1 entry for me, as far as I could see there should be no setting forcing it to take only 1.

vosian commented 3 years ago

While I'm aware that the following site has an RSS feed, I tried morss directly on the site to test the issue I described. And here as well I'm getting a single articles and several "dropped" notices.

morss --items "/html/body/div[1]/div[5]/div/div/div[1]/div/div[2]/div/div/div/div/div[1]/div/div[2]/div[1]/h2" https://pcsx2.net/

It might be worth noting that every single site I've tested morss on has returned a single feed, and I'm at my wits' end trying to find out what I'm doing wrong.

pictuga commented 3 years ago

Have you tried with SQLITE_PATH? Default path is in-memory and therefore cleared every time.

Also, have you checked what happens when adding --proxy?

vosian commented 3 years ago

SQLITE_PATH has no effect on entries being dropped, however, when using --proxy it seems no entries are dropped, so in the case of https://pcsx2.net/ it picks up 5 articles.

Maybe after a certain character limit it's dropping everything else you throw at it? just a baseless guess.

vosian commented 2 years ago

Trying this again after a long time (I did updated morss) the articles seems to be fetched correctly, so this can be fixed.