internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.79k stars 761 forks source link

-50 NOTCRAWLED https://podcrto.si/ #258

Open mkragelj opened 5 years ago

mkragelj commented 5 years ago

hi,

I try to harvest this site: https://podcrto.si As National Library we harvest several domains to preserve the information. I tried with Heritrix 1.14.4 and 3.4 but without success.

I'm getting this:

[code] [status] [seed] [redirect] -50 NOTCRAWLED https://podcrto.si/ 200 CRAWLED https://e-uprava.gov.si/

and

LONGEST#2: Queue si,podcrto, (p3) 2 items wakes in: 13m45s77ms last enqueued: https://podcrto.si/robots.txt last peeked: https://podcrto.si/robots.txt total expended: 15 (total budget: -1) active balance: 2985 last(avg) cost: 1(1) totalScheduled fetchSuccesses fetchFailures fetchDisregards fetchResponses robotsDenials successBytes totalBytes fetchNonResponses lastSuccessTime 3 1 0 0 1 0 54 54 16 2019-05-23T07:14:29.825Z SimplePrecedenceProvider 3

Can anyone help or explain what could be the reason for this? Thank you in advance.

Best, Matjaž

mkragelj commented 5 years ago

..works with WCT 2.01

Matjaž