internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.77k stars 757 forks source link

HTTP response only results in garbage bytes #206

Closed bitsgalore closed 6 years ago

bitsgalore commented 6 years ago

I'm trying to run the latest Heritrix build (build heritrix-3.3.0-20180529.100446-105-dist.tar.gz which I downloaded here) for some tests.

I try to start Heritrix with the below command::

~/heritrix-3.3.0-SNAPSHOT/bin/heritrix -a foo

This works, but when I open http://localhost:8443/ in my browser (Firefox), it only shows 6 garbled characters (Chromium returns a ERR_INVALID_HTTP_RESPONSE error). Saving the page and opening it in a Hex editor shows these 7 bytes:

15 03 03 00 02 02 0A

Some info on Java on my system:

openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

Here's the Heritrix log file:

Thu Jun 21 13:15:54 CEST 2018 Starting heritrix
Linux johan-HP-ProBook-640-G1 4.10.0-38-generic #42~16.04.1-Ubuntu SMP Tue Oct 10 16:32:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
JAVA_OPTS= -Xmx256m
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31394
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 31394
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Oracle Corporation OpenJDK Runtime Environment 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore adhoc.keystore -destkeystore adhoc.keystore -deststoretype pkcs12".
Using ad-hoc HTTPS certificate with fingerprint...
SHA1:55:BA:62:92:98:5A:DB:26:1B:08:70:D8:90:5D:9C:F3:A4:E7:BF:81
Verify in browser before accepting exception.
2018-06-21 11:15:55.239 WARNING thread-1 org.archive.crawler.framework.Engine.findJobConfigs() invalid job directory: ./jobs/.gitignore where job expected from: ./jobs/.gitignore
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
engine listening at port 8443
operator login set per command-line
NOTE: We recommend a longer, stronger password, especially if your web 
interface will be internet-accessible.
Heritrix version: 3.3.0-SNAPSHOT-2018-05-29T09:43:19Z

The log contains a number of warnings, but I have no idea if they are related to this.

Perhaps I'm doing something wrong myself (this my first attempt at installing and running Heritrix). Anyway, if anyone could give me a hint on how to make this work that would be really helpful. (Side note: I initially tried the "stable" 3.2 release, but gave up on that because of the dependency on Java 7.)

anjackson commented 6 years ago

You need to go to https://localhost:8443/ because it's only accessible over SSL. Not sure if there's an elegant way to handle this and bounce users to HTTPS automatically?

anjackson commented 6 years ago

We need to improve the docs because 3.2 is buggy. See https://github.com/internetarchive/heritrix3/wiki#latest-releases and https://trello.com/c/Inb8MW5w/29-establish-offical-heritrix-releases

bitsgalore commented 6 years ago

@anjackson Thanks Andy. Turns out the docs actually mention this but I had overlooked it. Works now!

I'll close this issue now.

guitarscape commented 5 years ago

it seems that 3.3 does not allow http access? is there a way to enforce http (not https) so that we can use heritrix behind proxy?

xdpirate commented 1 year ago

I just ran into this same problem. A HTTP to HTTPS redirect, or a note in the docs saying you must access the web UI over HTTPS would be appreciated.