infinilabs / crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
Other
305 stars 82 forks source link
crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider
What a Spider!

GOPA, A Spider Written in Go.

Travis Go Report Card Join the chat at https://gitter.im/infinitbyte/gopa

Goal

Screenshoot

What a Spider! GOPA Spider!

How to use

Requirements

Setup

First of all, get it, two opinions: download the pre-built package or compile it yourself.

Download Pre Built Package

Go to Release page, download the right package for your platform.

Note: Darwin is for Mac

Compile The Package Manually

Requirements

Supported platform

For example:

#apt  install golang-go
#brew install golang
mkdir ~/go/src/github.com/infinitbyte/ -p
cd ~/go/src/github.com/infinitbyte/
git clone https://github.com/infinitbyte/gopa.git
cd gopa
make

After a few minutes, you should have:

gopa, the main program, a single binary.
gopa.yml, main configuration for gopa.

Required Config

Note: Elasticsearch version should >= v5.3

</details></p>

### Start

Besides Elasticsearch, Gopa doesn't require any other dependencies, just simply run `./gopa` to start the program.

Gopa can be run as daemon(_Note: Only available on Linux and Mac_):
<p><details>
  <summary>Example</summary>
  <pre>
➜  gopa git:(master) ✗ ./bin/gopa --daemon
  ________ ________ __________  _____
 /  _____/ \_____  \\______   \/  _  \
/   \  ___  /   |   \|     ___/  /_\  \
\    \_\  \/    |    \    |  /    |    \
 \______  /\_______  /____|  \____|__  /
        \/         \/                \/
[gopa] 0.10.0_SNAPSHOT
///last commit: 99616a2, Fri Oct 20 14:04:54 2017 +0200, medcl, update version to 0.10.0 ///

[10-21 16:01:09] [INF] [instance.go:23] workspace: data/gopa/nodes/0
[gopa] started.</pre>
</details></p>

Also run `./gopa -h` to get the full list of command line options.
<p><details>
  <summary>Example</summary>
  <pre>
➜  gopa git:(master) ✗ ./bin/gopa -h
  ________ ________ __________  _____
 /  _____/ \_____  \\______   \/  _  \
/   \  ___  /   |   \|     ___/  /_\  \
\    \_\  \/    |    \    |  /    |    \
 \______  /\_______  /____|  \____|__  /
        \/         \/                \/
[gopa] 0.10.0_SNAPSHOT
///last commit: 99616a2, Fri Oct 20 14:04:54 2017 +0200, medcl, update version to 0.10.0 ///

Usage of ./bin/gopa:
  -config string
        the location of config file (default "gopa.yml")
  -cpuprofile string
        write cpu profile to this file
  -daemon
        run in background as daemon
  -debug
        run in debug mode, gopa will quit with panic error
  -log string
        the log level,options:trace,debug,info,warn,error (default "info")
  -log_path string
        the log path (default "log")
  -memprofile string
        write memory profile to this file
  -pidfile string
        pidfile path (only for daemon)
  -pprof string
        enable and setup pprof/expvar service, eg: localhost:6060 , the endpoint will be: http://localhost:6060/debug/pprof/ and http://localhost:6060/debug/vars</pre>
</details></p>

### Stop

It's safety to press `ctrl+c` stop the current running Gopa, Gopa will handle the rest,saving the checkpoint,
you may restore the job later, the world is still in your hand.

If you are running `Gopa` as daemon, you may stop it like this:

kill -QUIT pgrep gopa



## Configuration

## UI

* Search Console `http://127.0.0.1:9000/`
* Admin Console  `http://127.0.0.1:9000/admin/`

## API

## Architecture

<img width="800" alt="What a Spider! GOPA Spider!" src="https://raw.githubusercontent.com/infinitbyte/gopa/master/docs/assets/img/architecture-v1.png">

## Who uses it?

You use GOPA and you want to be listed there? [Contact me](https://medcl.com).

License
=======
Released under the [Apache License, Version 2.0](https://github.com/infinitbyte/gopa/blob/master/LICENSE) .