wallabag / wallabag

wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
https://wallabag.org
MIT License
10.54k stars 772 forks source link

Wrong display in wallabag (infoservice.heise.de) #7700

Open wallabag-bot opened 1 month ago

wallabag-bot commented 1 month ago

Sent by Strubbl and automatically created by email


https://infoservice.heise.de/html_mail.jsp?params=AqwgdYQb8Tsjf1tuACUicoAFCwpd6BATNwsMs0IucsqNpq1dSkNAX4NcGAT3rP5LK63JX1a3rG3XXOgLfxSi2Y8S6lsRHabLZ2bqsDEi5s4%3D

e.g. images are missing

HolgerAusB commented 1 month ago

@Strubbl the link is not valid, maybe a personal one? Maybe a Log-In needed?

I got: 'Sorry, there was an error while processing your request.'

Strubbl commented 1 month ago

Thank you, the link should work now

HolgerAusB commented 1 month ago

@Strubbl. That is ugly! Whole site organized in tables. Will see if this will work. Do you have more article-URLs to cross-check

<body>
  <table>
    <tbody>
      <tr>
        <td>
          <div>
            <table>
              <tbody>
                <tr>
                  <td> <table+tr+td> first paragraph </td>
                </tr>
                <tr>
                  <td> <table+tr+td> second paragraph </td>
                </tr>
Strubbl commented 1 month ago

No, unfortunately not. We'll have to wait until next week Friday

HolgerAusB commented 1 month ago

heise.de is the publisher with the best German computer magazine. But their newsletter software is terrible. You dont see images, because they are using illegal <img src="..."></img> instead of <img src="..." />. Hm, I think it's wrong, or?

When fetching with wallabag UI all images are missing. I think wallabag strips those empty img nodes. When using wallabagger the images are in, but with large spaces between paragraphs. At the moment I don't see an anchor, where I can strip those.

As this is just the browser-version of an email-newsletter, I will not upload a config to the repo. To strip some part from the 'header', you can start with this config:

body: //body/table[1]//table[1]
strip: //a[@class='_browserlink']/ancestor::td[1]

By the way, Fulltext-RSS shows a much nicer excerpt. And pushing this with wallabagger to wallabag the spaces are still to big but a bit less than the direct wallabagger fetch:

https://ftr.fivefilters.net/

Strubbl commented 1 month ago

I think there was an option like tidy to not strip empty nodes.

HolgerAusB commented 1 month ago

I tried all combinations of prune and tidy, nothing helped. That empty-node-strip is IMHO not configurable. And I think it is done before the config is parsed/used.

Strubbl commented 1 month ago

Did you also try to replace closing img tag with nothing?

HolgerAusB commented 1 month ago

yes I tried. But as I wrote: IMHO the strip-empty-node thing is done before anything from config is performed.

you may try to comment-out line 329-330 of vendor/j0k3r/graby/src/Graby.php just for testing:

// Remove empty nodes (except iframe)
  $re = '/<(?!iframe)([^>\s]+)[^>]*>(?:<br \/>|&nbsp;|&thinsp;|&ensp;|&emsp;|&#8201;|&#8194;|&#8195;|\s)*<\/\1>/m';
  $html = preg_replace($re, '', (string) $htmlCleaned);
HolgerAusB commented 1 month ago

or changing just the first line to not stripping img besides iframe:

$re = '/<(?!(iframe|img))([^>\s]+)[^>]*>(?:<br \/>|&nbsp;|&thinsp;|&ensp;|&emsp;|&#8201;|&#8194;|&#8195;|\s)*<\/\1>/m';
Strubbl commented 1 month ago

For me it says "Eintrag neugeladen, aber das Abrufen des Inhalts ist fehlgeschlagen". So not even content fetching works with a custom config.

edit: Ok, now i also have the img fetching problem.

Strubbl commented 1 month ago

or changing just the first line to not stripping img besides iframe:

$re = '/<(?!(iframe|img))([^>\s]+)[^>]*>(?:<br \/>|&nbsp;|&thinsp;|&ensp;|&emsp;|&#8201;|&#8194;|&#8195;|\s)*<\/\1>/m';

Yes, this works. But i do not get why the img tag gets removed, because not all images have the closing tag Ok, then this is a graby bug that img tag get stripped?

Strubbl commented 1 month ago

The graby version in wallabg is also older than the current master, where also td and th are not stripped: https://github.com/j0k3r/graby/blob/1281bf3d7045d2f2682d1af6ba3715e492184e9a/src/Graby.php#L277

Strubbl commented 1 month ago

this is my current progress for infoservice.heise.de.txt along with the Graby.php change you proposed:

find_string: <table
replace_string: <span

find_string: </table>
replace_string: </span>

find_string: <tr>
replace_string: 

find_string: </tr>
replace_string: 

find_string: <td
replace_string: <p

find_string: </td>
replace_string: </p>

find_string: <tbody>
replace_string: 

find_string: </tbody>
replace_string: 

test_url: https://infoservice.heise.de/html_mail.jsp?params=AqwgdYQb8Tsjf1tuACUicoAFCwpd6BATNwsMs0IucsqNpq1dSkNAX4NcGAT3rP5LK63JX1a3rG3XXOgLfxSi2Y8S6lsRHabLZ2bqsDEi5s4%3D
HolgerAusB commented 1 month ago

Drastic! but universal. You should strip the test_url, if you want to PR this. Because that looks like, as it is personalized for you as an individual.

I had another idea, for self-hosters only, to 'remove' the gaps between paragraphs.

infoservice.heise.de.txt:

body: //body/table[1]//table[1]
strip: //a[@class='_browserlink']/ancestor::td[1]
replace_string(<td): <td class="heise_nl"

test_url: https://example.com/link-should-not-named-here/because-of-personalized-newsletter-links

web/custom.css (create, if not existing):

td.heise_nl {
    padding: 1px 0px 1px 0px !important;
}

close and restart browser

Yes, this works. But i do not get why the img tag gets removed, because not all images have the closing tag Ok, then this is a graby bug that img tag get stripped?

I wouldn't call it a bug on graby's side. img is not defined with a closing tag. And it does not make sense to use a pair of opening/closing tags with no content between. <div></div> or <td></td> is useless, so wallabag is stripping those, to save storage space and speed up following modifications via config.

But wallabag is proving emptiness on all nodes but iframe. And I agree that img should be excepted here too. Yes , </img> is not standard, but it is used by websites, which always causes problems with wallabag. And many sites don't want there content to be archived by users. So they might be using things like this and other things, ignored by browsers but messing up graby (e.g. <span> stripping).

So I think, that graby shouldn't be so greedy here. @j0k3r, what do you think?

Strubbl commented 1 month ago

You should strip the test_url, if you want to PR this. Because that looks like, as it is personalized for you as an individual.

Maybe, but it was mentioned in weeklyOSM: https://weeklyosm.eu/de/archives/17520#wn741_31473 and thus it's not personalized to me at least.