internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.77k stars 757 forks source link

Command-line install trouble #332

Closed mlforcada closed 1 year ago

mlforcada commented 4 years ago

Dear devs, I have tried to install heritrix3 on the command line on my system and have found a problem. The output of uname -a is:

Linux lenovomlf 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Here's what I've done (both with a fresh clone from your master and with the latest release, heritrix3-3.4.0-20200304). Example uses the release uncompressed at /home/mlf/tmp/heritrix3-3.4.0-20200304:

mvn install

which runs successfully

[INFO] BUILD SUCCESS

and then, following The install guide

export HERITRIX_HOME=/home/mlf/tmp/heritrix3-3.4.0-20200304/dist/src/main/
chmod u+x $HERITRIX_HOME/bin/heritrix
$HERITRIX_HOME/bin/heritrix --help

When I run the last command I get the error:

ls: no s’ha pogut accedir a '/home/mlf/tmp/heritrix3-3.4.0-20200304/dist/src/main//lib/*.jar': El fitxer o directori no existeix
Error: Could not find or load main class org.archive.crawler.Heritrix

What happens is that there is no lib file in dist/src/main/ This also happens with a fresh clone from master. Maybe I am doing something wrong, can you please help?

ato commented 4 years ago

Hi Mikel,

The install guide you linked makes the assumption the user would be installing the precompiled application not building it from source. The step you seem to be missing when building from source is the maven build produces the binary distribution tar file at dist/target/heritrix-3.4.0-SNAPSHOT-dist.tar.gz. As mentioned in the install guide you will need to unpack the dist tar somewhere to install Heritrix. The bin/heritrix script assumes HERITRIX_HOME points at the unpacked binary distribution and as you found will not work when pointing at the source code.

For example:

cd ~/tmp
tar -zxvf heritrix3/dist/target/heritrix-3.4.0-SNAPSHOT-dist.tar.gz
export HERITRIX_HOME=$PWD/heritrix-3.4.0-SNAPSHOT
$HERITRIX_HOME/bin/heritrix --help

I should also warn that the guide you linked to is quite old and was written for Heritrix 1.x. I think that part of the process is still the same but beware that other information on the old SourceForge site is likely very out of date. Heritrix 3 documentation is in the Github wiki (although there's unfortunately a lot of out of date pages there too).

Hope that helps,

Alex

mlforcada commented 4 years ago

Thanks a million, ato!