j0k3r / graby

Graby helps you extract article content from web pages
MIT License
365 stars 74 forks source link

Fingerprint match could double the output #339

Open HolgerAusB opened 12 months ago

HolgerAusB commented 12 months ago

@j0k3r (@fivefilters),

While fixing https://github.com/wallabag/wallabag/issues/7013 I got a weird output with wallabag.

When there is a .googleblog.com.txt (with same content as .blogspot.com.txt) wallabag catches the same article twice to the same wallabag item. So I needed remove *.googleblog.com.txt files from the repo. But that could be bad for projects, not using fingerprinting.

The following links are recognized by wallabag/graby via fingerprint as blogspot.com

https://android-developers.googleblog.com/2017/08/introducing-android-8-oreo.html https://security.googleblog.com/2023/09/scaling-rust-adoption-through-training.html https://webmasters.googleblog.com/2016/08/helping-users-easily-access-content-on.html

EDIT: No double output with FTR, by the way