wallabag / wallabag

wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
https://wallabag.org
MIT License
10.19k stars 759 forks source link

Article containing multiple h1 is not saved properly (missing said h1's) #5095

Closed Crocmagnon closed 1 year ago

Crocmagnon commented 3 years ago

Environment

My app/config/parameters.yml is: ```yaml # This file is auto-generated during the composer install parameters: database_driver: pdo_sqlite database_host: 127.0.0.1 database_port: null database_name: symfony database_user: root database_password: null database_path: '%kernel.root_dir%/../data/db/wallabag.sqlite' database_table_prefix: wallabag_ database_socket: null database_charset: utf8 domain_name: 'https://wb.example.com' mailer_transport: smtp mailer_user: wallabag@example.com mailer_password: password mailer_host: smtp.example.com mailer_port: 587 mailer_encryption: null mailer_auth_mode: null locale: fr secret: secret twofactor_auth: true twofactor_sender: no-reply@wallabag.org fosuser_registration: false fosuser_confirmation: true fos_oauth_server_access_token_lifetime: 3600 fos_oauth_server_refresh_token_lifetime: 1209600 from_email: wallabag@example.com rss_limit: 50 rabbitmq_host: localhost rabbitmq_port: 5672 rabbitmq_user: guest rabbitmq_password: guest rabbitmq_prefetch_count: 10 redis_scheme: tcp redis_host: redis redis_port: 6379 redis_path: null redis_password: null sentry_dsn: null server_name: 'Your wallabag instance' ```

What steps will reproduce the bug?

  1. Add the following article : https://dafoster.net/articles/2021/02/16/building-web-apps-with-vue-and-django-the-ultimate-guide/
  2. Notice that the <h1> tags inside the article are not saved in wallabag's DB, like "1 server or 2 servers?", or "1-server approach".
anarcat commented 2 years ago

i also see this, it's as if wallabag purposefully just deletes H1 headings, which is really confusing in most articles.

anarcat commented 2 years ago

in fact, from what i can tell, the problem happens specifically when the first heading in a document is a <h2> followed by a <h1>.

update: nevermind that, i get the <h1> problem regardless of the structure of the document

amberin commented 1 year ago

Yes, I don't believe this should be tagged as a "site config". This happens consistently everywhere for me. It's very confusing and it's really preventing me from using Wallabag.

Crocmagnon commented 1 year ago

It may fix the issue on this site, but it was merely used as an example. I believe we should find a more durable fix and maybe consider all subsequent h1 as h2 for example?

HolgerAusB commented 1 year ago

I am with you, @Crocmagnon but wallabag/graby is designed to get the most out of an article even without any config-file. In this case, even if prune: no had been the standard, the h1 where missing. It was only the combination with my body-selector, which brings it back and I even don't know why.

But I agree, that <h1> or <span> should not removed automatically.

And it drives me crazy some times that some articles works without any configuration in @fivefilter's FulltextRSS (FTR) while @j0k3r's graby4wallabag/f43.me only gets an error message or fetches too much or too little from the source site - or vice versa.

Or that a site_config works with one product perfectly while for the other more tweaks or a complete rewrite is necessary.

It would be very helpful, if graby and FTR use the same engine and presets or even give us some more options for the configs:

# valid for both
prune: no
include: different.example.com.txt
replace_regex: / '(search.*) term' / 'new $1 term' /

[FTR]
include: custom/cookie.example.com.txt  # for sending auth-cookies for paywall
body: //article
...

[graby]
body: //div[@id='main']
...

[wallabag]
include [graby]   # this should be standard without writing. 
include ![graby]  # to not use the graby section for wallabag
# credentialstuff
j0k3r commented 1 year ago

Sth was started by @Kdecherf https://github.com/j0k3r/php-readability/pull/75