medialab / minet

A webmining CLI tool & library for python.
GNU General Public License v3.0
290 stars 26 forks source link

Errors to investigate #724

Open 16arpi opened 1 year ago

16arpi commented 1 year ago

These urls are concerned by errors during webmining.

URLs extracted from error

max-redirects

infinite-redirects

read-timeout

unknown-host

invalid-redirect

connection-aborted

invalid-url

connection-refused

self-redirect

connect-timeout

ssl

no-route-to-host

connection-error

invalid-gzip

URLs extracted from extract_error

invalid-status

no-result

errored

invalid-mimetype

file-not-found

trafilatura-error

Yomguithereal commented 1 year ago

some invalid redirects now handled through stateful redirection

Yomguithereal commented 1 year ago

https://www.ippmedia.com:/en/news/govt-takes-steps-ensure-availability-fertiliser-farmers redirects to somewhat illegal url with empty port