Keep h1 and other headings

j0k3r / php-readability

A fork of https://bitbucket.org/fivefilters/php-readability

Apache License 2.0

168 stars 36 forks source link

Keep h1 and other headings #75

Open Kdecherf opened 2 years ago

Kdecherf commented 2 years ago

Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side.

Fixes https://github.com/wallabag/wallabag/issues/5805

Kdecherf commented 2 years ago

Request for comment @j0k3r @jtojnar

jtojnar commented 2 years ago

I think this sort of clean-up has some merit. Maybe we could only decide to clean out the h1 if there is only a single one, like we do with h2. And only clean up h2 if there is no h1?

Kdecherf commented 2 years ago

@jtojnar done

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 2583510249

4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 84.669%

Totals
Change from base Build 2486672970:	0.2%
Covered Lines:	602
Relevant Lines:	711

💛 - Coveralls

Kdecherf commented 2 years ago

I'm currently having a second though about this cleanup.

Take this link: https://interestingengineering.com/innovation/china-plans-to-build-the-worlds-first-waterless-nuclear-reactor The article only contains one h2 entity and is less than 100 characters (« China's plans to curb its carbon emissions »), thus it is removed by the routine. However this heading has its importance in this context.

What should we do? Maybe we could remove the length condition and replace it with something like "if it is similar to the article's title"?

jtojnar commented 2 years ago

Similarity would make sense but then we would need to decide on the precise metric.

Another possible heuristic would be checking if the heading is the first element in the content. Then it would spuriously preserve the heading in the case of <summary> <heading> <actual content> but that would resolve itself if we drop the summary (which is orthogonal to this). And under-removal is probably better than over-removal anyway.

j0k3r commented 2 years ago

Another possible heuristic would be checking if the heading is the first element in the content

Just checking if the first child of the content is h*, right?

jtojnar commented 2 years ago

Right, that is what I meant.

Kdecherf commented 1 year ago

I'm trying to resume work on this PR.

I've made a small check on entries hosted on my instance, using a query like select id, substring(content FOR 250) from wallabag_entry where substring(content FOR 15) ilike '<h_>%' limit 5;.

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

https://futurism.com/the-byte/china-ai-prosecutor-crimes (h2 as a subtitle)
https://www.theregister.com/2019/10/23/ai_dataset_imagenet_consent/ (h2 as a subtitle)
https://affordance.framasoft.org/2022/03/par-dela-like-colere/ (h2)

What should we do?

jtojnar commented 1 year ago

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

Would not those articles still begin with h1 pre-cleanup? In that case, the second heuristic would only remove the h1 since it is the first heading in the content.

The third one would not be matched by the first heuristic.

Ideally, we would combine both.

We should add a test suite for all these cases so that we can better discuss what is happening.