Open wallabag-bot opened 1 month ago
@Strubbl the link is not valid, maybe a personal one? Maybe a Log-In needed?
I got: 'Sorry, there was an error while processing your request.'
Thank you, the link should work now
@Strubbl. That is ugly! Whole site organized in tables. Will see if this will work. Do you have more article-URLs to cross-check
<body>
<table>
<tbody>
<tr>
<td>
<div>
<table>
<tbody>
<tr>
<td> <table+tr+td> first paragraph </td>
</tr>
<tr>
<td> <table+tr+td> second paragraph </td>
</tr>
No, unfortunately not. We'll have to wait until next week Friday
heise.de is the publisher with the best German computer magazine. But their newsletter software is terrible. You dont see images, because they are using illegal <img src="..."></img>
instead of <img src="..." />
. Hm, I think it's wrong, or?
When fetching with wallabag UI all images are missing. I think wallabag strips those empty img nodes. When using wallabagger the images are in, but with large spaces between paragraphs. At the moment I don't see an anchor, where I can strip those.
As this is just the browser-version of an email-newsletter, I will not upload a config to the repo. To strip some part from the 'header', you can start with this config:
body: //body/table[1]//table[1]
strip: //a[@class='_browserlink']/ancestor::td[1]
By the way, Fulltext-RSS shows a much nicer excerpt. And pushing this with wallabagger to wallabag the spaces are still to big but a bit less than the direct wallabagger fetch:
I think there was an option like tidy to not strip empty nodes.
I tried all combinations of prune and tidy, nothing helped. That empty-node-strip is IMHO not configurable. And I think it is done before the config is parsed/used.
Did you also try to replace closing img tag with nothing?
yes I tried. But as I wrote: IMHO the strip-empty-node thing is done before anything from config is performed.
you may try to comment-out line 329-330 of vendor/j0k3r/graby/src/Graby.php
just for testing:
// Remove empty nodes (except iframe)
$re = '/<(?!iframe)([^>\s]+)[^>]*>(?:<br \/>| | | | | | | |\s)*<\/\1>/m';
$html = preg_replace($re, '', (string) $htmlCleaned);
or changing just the first line to not stripping img besides iframe:
$re = '/<(?!(iframe|img))([^>\s]+)[^>]*>(?:<br \/>| | | | | | | |\s)*<\/\1>/m';
For me it says "Eintrag neugeladen, aber das Abrufen des Inhalts ist fehlgeschlagen". So not even content fetching works with a custom config.
edit: Ok, now i also have the img fetching problem.
or changing just the first line to not stripping img besides iframe:
$re = '/<(?!(iframe|img))([^>\s]+)[^>]*>(?:<br \/>| | | | | | | |\s)*<\/\1>/m';
Yes, this works. But i do not get why the img tag gets removed, because not all images have the closing tag
Ok, then this is a graby bug that img tag get stripped?
The graby version in wallabg is also older than the current master, where also td and th are not stripped: https://github.com/j0k3r/graby/blob/1281bf3d7045d2f2682d1af6ba3715e492184e9a/src/Graby.php#L277
this is my current progress for infoservice.heise.de.txt
along with the Graby.php change you proposed:
find_string: <table
replace_string: <span
find_string: </table>
replace_string: </span>
find_string: <tr>
replace_string:
find_string: </tr>
replace_string:
find_string: <td
replace_string: <p
find_string: </td>
replace_string: </p>
find_string: <tbody>
replace_string:
find_string: </tbody>
replace_string:
test_url: https://infoservice.heise.de/html_mail.jsp?params=AqwgdYQb8Tsjf1tuACUicoAFCwpd6BATNwsMs0IucsqNpq1dSkNAX4NcGAT3rP5LK63JX1a3rG3XXOgLfxSi2Y8S6lsRHabLZ2bqsDEi5s4%3D
Drastic! but universal. You should strip the test_url, if you want to PR this. Because that looks like, as it is personalized for you as an individual.
I had another idea, for self-hosters only, to 'remove' the gaps between paragraphs.
infoservice.heise.de.txt
:
body: //body/table[1]//table[1]
strip: //a[@class='_browserlink']/ancestor::td[1]
replace_string(<td): <td class="heise_nl"
test_url: https://example.com/link-should-not-named-here/because-of-personalized-newsletter-links
web/custom.css
(create, if not existing):
td.heise_nl {
padding: 1px 0px 1px 0px !important;
}
close and restart browser
Yes, this works.
But i do not get why the img tag gets removed, because not all images have the closing tagOk, then this is a graby bug that img tag get stripped?
I wouldn't call it a bug on graby's side. img is not defined with a closing tag. And it does not make sense to use a pair of opening/closing tags with no content between. <div></div>
or <td></td>
is useless, so wallabag is stripping those, to save storage space and speed up following modifications via config.
But wallabag is proving emptiness on all nodes but iframe. And I agree that img should be excepted here too.
Yes , </img>
is not standard, but it is used by websites, which always causes problems with wallabag. And many sites don't want there content to be archived by users. So they might be using things like this and other things, ignored by browsers but messing up graby (e.g. <span>
stripping).
So I think, that graby shouldn't be so greedy here. @j0k3r, what do you think?
You should strip the test_url, if you want to PR this. Because that looks like, as it is personalized for you as an individual.
Maybe, but it was mentioned in weeklyOSM: https://weeklyosm.eu/de/archives/17520#wn741_31473 and thus it's not personalized to me at least.
Sent by Strubbl and automatically created by email
https://infoservice.heise.de/html_mail.jsp?params=AqwgdYQb8Tsjf1tuACUicoAFCwpd6BATNwsMs0IucsqNpq1dSkNAX4NcGAT3rP5LK63JX1a3rG3XXOgLfxSi2Y8S6lsRHabLZ2bqsDEi5s4%3D
e.g. images are missing