martinrotter / rssguard

Feed reader (and podcast player) which supports RSS/ATOM/JSON and many web-based feed services.
GNU General Public License v3.0
1.64k stars 125 forks source link

[BUG]: CSS2RSS doesnt work #1519

Closed hercyle closed 1 month ago

hercyle commented 1 month ago

Brief description of the issue

there seems to be a problem with escaping disappearing (?) in the current RSSguard 4.7.4 version

How to reproduce the bug?

  1. add a new feed of a website that doesnt support rss feeds
  2. fetch metadata

What was the expected result?

it should have just fetched the metadata succesfully then fetch articles succesfully

What actually happened?

i couldnt fetch the metadata nor the articles even after giving it a custom title and saving it.

Debug log

time=" 63.583" type="critical" -> database: Cannot overwrite feed: 'script threw an error: 'Traceback (most recent call last): File "/home/user/.config/RSS Guard 4/css2rss.py", line 146, in found_items = soup.select(sys.argv[1]) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/bs4/element.py", line 2116, in select return self.css.select(selector, namespaces, limit, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/bs4/css.py", line 162, in select self.api.select( File "/usr/lib/python3.12/site-packages/soupsieve/init.py", line 147, in select return compile(select, namespaces, flags, kwargs).select(tag, limit) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/soupsieve/init.py", line 65, in compile return cp._cached_css_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/soupsieve/css_parser.py", line 210, in _cached_css_compile ).process_selectors(), ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/soupsieve/css_parser.py", line 1138, in process_selectors return self.parse_selectors(self.selector_iter(self.pattern), index, flags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/soupsieve/css_parser.py", line 982, in parse_selectors has_selector, is_html = self.parse_pseudo_class(sel, m, has_selector, iselector, is_html) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/soupsieve/css_parser.py", line 658, in parse_pseudo_class raise SelectorSyntaxError( soupsieve.util.SelectorSyntaxError: ':text-accent' was detected as a pseudo-class and is either unsupported or invalid. If the syntax was not intended to be recognized as a pseudo-class, please escape the colon. line 1: div.scrollable-panel a.visited:text-accent ^''. time=" 128.940" type="warning" -> virtual void QtWaylandClient::QWaylandTextInputv3::zwp_text_input_v3_leave(wl_surface*) Got leave event for surface 0x0 focused surface 0x58872874da30

Operating system and version

Version: 4.7.4 (built on Linux/x86_64)
Revision: 68c322710-lite
Build date: 9/26/24 7:52 PM
OS: Arch Linux
Qt: 6.7.3 (compiled against 6.7.2)
seventyiris83 commented 1 month ago

try python css2rss.py ".space-x-1 > a" "!One Piece"

hercyle commented 1 month ago

you suggestion didnt made it past network: Error when fetching feed: 'Feed::Status::ParsingError' message: 'XML problem: Start tag expected.' so i guess yours was wrong.

if i downgrade to rssguard version 4.6.3 my command does work just fine.

seventyiris83 commented 1 month ago

forgot to mention im on 4.7.4 lite, works fine for me

hercyle commented 1 month ago

is this how your css2rss installation looks like?

➜  RSS Guard 4
.
├── config
│  ├── config.ini
│  ├── key.private
│  └── QtProject.conf
├── database
│  ├── database.db
│  ├── database.db-v4.bak
│  ├── database.db-v5.bak
│  └── database.db-v7.bak
└── css2rss.py
seventyiris83 commented 1 month ago

yes

hercyle commented 1 month ago

there seemed to be something wrong with my css2rss installation, and now, your command does work. but the original problem still remains. found out that if i try to get rss feeds from this source (NSFW) with: python css2rss.py "@div.grid > div.flex-row" "h5" "img" ".pb-1 > a[href*=chapter]" "span.text-xxs" "span.flex" it'll throw me errors about maya

Traceback (most recent call last):
File "/home/user/.config/RSS Gugard 4/css2rss.py", line 170, in <module> import maya ModuleNotFoundError: No module named 'maya' 

and if i remove the 6th argument (item date) python css2rss.py "@div.grid > div.flex-row" "h5" "img" ".pb-1 > a[href*=chapter]" "span.text-xxs" it works without issues, problem would be that i'd be leaving behind the item dates which i find rather important. apparently there seems to be issues with rssguard's own python (?).

my install of maya installed with through pipx

$ python --version
Python 3.12.6

$ pipx list                        
venvs are in /home/user/.local/pipx/venvs 
apps are exposed on your $PATH at /home/user/.local/bin 
manual pages are exposed at /home/user/.local/share/man 
  package maya 0.6.1, installed using Python 3.12.6 
   - dateparser-download
martinrotter commented 1 month ago

Hi.

Some versions ago, CLI args tokenization was completely rewritten from scratch and follows same semantics as CLI handling in Bash.

Some of info is in warning boxes in the documentation!

https://rssguard.readthedocs.io/en/stable/features/scraping.html

If you use double quotes, then you have to ESCAPE all special characters inside to make sure they are passed over literally.

If you use single quotes, all inside characters are passed over exactly as written.

In your example, you have "div.scrollable-panel a.visited\:text-accent" meaning that escape \ character is swallowed by the tokenization mechanism and next character is passed over literally.

If you want to pass \ to css2rss literally, you have to either double escape it, or you have to use single quotes to quote whole parameter. Both ways in screens below.

@Owyn You can perhaps update your documentation to reflect this.

image image

Owyn commented 1 month ago

this seems like a change breaking back-compatibility...

You can perhaps update your documentation to reflect this.

👍🏻 done