Open Kdecherf opened 2 years ago
Request for comment @j0k3r @jtojnar
I think this sort of clean-up has some merit. Maybe we could only decide to clean out the h1
if there is only a single one, like we do with h2
. And only clean up h2
if there is no h1
?
@jtojnar done
Totals | |
---|---|
Change from base Build 2486672970: | 0.2% |
Covered Lines: | 602 |
Relevant Lines: | 711 |
I'm currently having a second though about this cleanup.
Take this link: https://interestingengineering.com/innovation/china-plans-to-build-the-worlds-first-waterless-nuclear-reactor The article only contains one h2 entity and is less than 100 characters (« China's plans to curb its carbon emissions »), thus it is removed by the routine. However this heading has its importance in this context.
What should we do? Maybe we could remove the length condition and replace it with something like "if it is similar to the article's title"?
Similarity would make sense but then we would need to decide on the precise metric.
Another possible heuristic would be checking if the heading is the first element in the content. Then it would spuriously preserve the heading in the case of <summary> <heading> <actual content>
but that would resolve itself if we drop the summary (which is orthogonal to this). And under-removal is probably better than over-removal anyway.
Another possible heuristic would be checking if the heading is the first element in the content
Just checking if the first child of the content is h*, right?
Right, that is what I meant.
I'm trying to resume work on this PR.
I've made a small check on entries hosted on my instance, using a query like select id, substring(content FOR 250) from wallabag_entry where substring(content FOR 15) ilike '<h_>%' limit 5;
.
It seems that there are several cases where the content begins with a legitimate heading entity, for example:
What should we do?
It seems that there are several cases where the content begins with a legitimate heading entity, for example:
Would not those articles still begin with h1
pre-cleanup? In that case, the second heuristic would only remove the h1
since it is the first heading in the content.
The third one would not be matched by the first heuristic.
Ideally, we would combine both.
We should add a test suite for all these cases so that we can better discuss what is happening.
Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side.
Fixes https://github.com/wallabag/wallabag/issues/5805