j0k3r / php-readability

A fork of https://bitbucket.org/fivefilters/php-readability
Apache License 2.0
168 stars 36 forks source link

Issue with one URL and his content #78

Closed hstanleycrow closed 1 year ago

hstanleycrow commented 2 years ago

Hi, I have many days using this code to extract the content from webs (a lot of them) but today I found one with one problem. The URL is https://www.searchmetrics.com/glossary/ranking-factor/ Readability extracts the text from the middle and ignores all the text over, It extracts on this text "In the graph, four example correlations and the respective curves are shown." This specific line gets the largest score, so, I am not sure what can I do ¿Any idea? Thank you very much.

Kdecherf commented 2 years ago

Hi @hstanleycrow,

You may need to take a look at https://github.com/j0k3r/graby which uses php-readability under the hood but with what we call "siteconfig" files^2 to change the way the tool will extract the content

hstanleycrow commented 2 years ago

Thank you very much for your help.

Hi @hstanleycrow,

You may need to take a look at https://github.com/j0k3r/graby which uses php-readability under the hood but with what we call "siteconfig" files1 to change the way the tool will extract the content

Footnotes

1. https://github.com/fivefilters/ftr-site-config [↩](#user-content-fnref-2-17444f9ff24a815a31a10c66ce2890eb)