This is meltmedia's fork of Apache Nutch to support our internal search efforts.
Nutch uses ant to build. Clone the repository and run ant
.
Nutch is a little difficult to get running at first. Here are the basic instructions for performing a crawl and getting the crawl data up on elasticsearch.
There is all kinds of voodoo required to get Nutch configured. I have started the process of adding a configuration
to this repository for crawling a locally mounted afp volume. After building, change your directory to runtime/afp
to use this
use this configuration. See the readme in that directory for more instructions.
To perform a crawl, you feed Nutch some seed URLs from a file and then it will populate a local crawl database. To get started, echo the URL into the seed file
mkdir urls
echo 'URL' >> urls/seed.txt
then start the crawler
./bin/nutch crawl urls -dir crawl -depth 10 -topN 5
Once the crawl is complete, feed the data into Elasticsearch using the ElasticsearchIndexer:
./bin/nutch org.apache.nutch.indexer.elasticsearch.ElasticsearchIndexer ELASTICSEARCH_DOMAIN 9300 crawl/crawldb crawl/linkdb crawl/segments/*