infinilabs / crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
Other
305 stars 82 forks source link

gopa against localhost #24

Closed vpatel-code closed 6 years ago

vpatel-code commented 6 years ago

Can I run this project against localhost (elasticsearch). What are the changes that are needed in gopa.yml?

I tried and I am able to follow all the steps: see below: image

The search results are empty, so not sure if the data is getting indexed.

medcl commented 6 years ago
  1. first edit the configuration
- name: index
  enabled: true
  ui:
    enabled: true
#    site_name: Elasticsearch
#    logo: https://static-www.elastic.co/cn/assets/blt6050efb80ceabd47/elastic-logo (2).svg?q=294
#    favicon: https://www.elastic.co/favicon.ico
  elasticsearch:
    endpoint: http://localhost:9200
    index_prefix: gopa-
    username: elastic
    password: changeme
  1. create mapping
curl --user elastic:changeme -XPUT "http://localhost:9200/gopa-index" -H 'Content-Type: application/json' -d'
{
"mappings": {
"doc": {
"properties": {
"host": {
"type": "keyword",
"ignore_above": 256
},
"snapshot": {
"properties": {
"bold": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
},
"content_type": {
"type": "keyword",
"ignore_above": 256
},
"file": {
"type": "keyword",
"ignore_above": 256
},
"ext": {
"type": "keyword",
"ignore_above": 256
},
"h1": {
"type": "text"
},
"h2": {
"type": "text"
},
"h3": {
"type": "text"
},
"h4": {
"type": "text"
},
"hash": {
"type": "keyword",
"ignore_above": 256
},
"id": {
"type": "keyword",
"ignore_above": 256
},
"images": {
"properties": {
"external": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
},
"internal": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"italic": {
"type": "text"
},
"links": {
"properties": {
"external": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
},
"internal": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"path": {
"type": "keyword",
"ignore_above": 256
},
"sim_hash": {
"type": "keyword",
"ignore_above": 256
},
"lang": {
"type": "keyword",
"ignore_above": 256
},
"screenshot_id": {
"type": "keyword",
"ignore_above": 256
},
"size": {
"type": "long"
},
"text": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"version": {
"type": "long"
}
}
},
"task": {
"properties": {
"breadth": {
"type": "long"
},
"created": {
"type": "date"
},
"depth": {
"type": "long"
},
"id": {
"type": "keyword",
"ignore_above": 256
},
"original_url": {
"type": "keyword",
"ignore_above": 256
},
"reference_url": {
"type": "keyword",
"ignore_above": 256
},
"schema": {
"type": "keyword",
"ignore_above": 256
},
"status": {
"type": "integer"
},
"updated": {
"type": "date"
},
"url": {
"type": "keyword",
"ignore_above": 256
},
"last_screenshot_id": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}'
  1. restart gopa

please let me know that works.

vpatel-code commented 6 years ago

Thank you medcl. That worked :).

1) - What are the types of pages that this crawls out of the box. Does this crawl pdf, docs etc and reads metadata?. If not, can you please give some pointers for the configuration? 2) - If I want to read a custom property from the page source while crawling, say metaxyz. Where is this configuration set, is that in the index mapping file?

Thanks.

medcl commented 6 years ago
  1. there is no limit,Gopa use filter joint to filter what url you want and what isnotyou want,the config is with the pipeline config:

    file_ext_match_rule:
                  should:
                    prefix: []
                    contain: []
                    suffix: []
                  must:
                    prefix: []
                    contain: []
                    suffix: []
                  must_not:
                    contain: [zip, exe, jar, js, css, rar, gz, zip, bmp, jpeg, gif, png, jpg, apk]
                    prefix: []
                    suffix: []
  2. sure, you can use the extract joint to extract any dom object by using css selector, with the pipeline config as well:

- joint: extract
            enabled: false
            parameters:
              html_block:
                your_tag_name1: ".tag_class"
medcl commented 6 years ago

Closing, for other questions, please open another issue.

vpatel-code commented 6 years ago

Thank You medcl