j0k3r / php-readability

A fork of https://bitbucket.org/fivefilters/php-readability
Apache License 2.0
168 stars 36 forks source link

Unexpected title cleaning #63

Open Simounet opened 3 years ago

Simounet commented 3 years ago

Hi there, I don't get why we are cleaning the title content before the : character. It could be legit content. https://github.com/j0k3r/php-readability/blob/9a490fac078b0f773c9848af1c6d76336a073a8d/src/Readability.php#L850

j0k3r commented 3 years ago

I agree it could be a legit content but I guess that in most cases, the text before : is often the website name.

It's here since the beginning: https://bitbucket.org/fivefilters/php-readability/src/5112edb387b53931ab9324b890fa581c0e951d2d/Readability.php#lines-260

Simounet commented 3 years ago

I understand but do you know many sites using this pattern? I don't. If we follow this rule, we should do the same with - and |. Sometimes the site name is at the beginning, sometimes at the end. Hard to tell. I think that we should remove this or at least be able to bypass this condition.