Closed vpatel-code closed 6 years ago
- name: index
enabled: true
ui:
enabled: true
# site_name: Elasticsearch
# logo: https://static-www.elastic.co/cn/assets/blt6050efb80ceabd47/elastic-logo (2).svg?q=294
# favicon: https://www.elastic.co/favicon.ico
elasticsearch:
endpoint: http://localhost:9200
index_prefix: gopa-
username: elastic
password: changeme
curl --user elastic:changeme -XPUT "http://localhost:9200/gopa-index" -H 'Content-Type: application/json' -d'
{
"mappings": {
"doc": {
"properties": {
"host": {
"type": "keyword",
"ignore_above": 256
},
"snapshot": {
"properties": {
"bold": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
},
"content_type": {
"type": "keyword",
"ignore_above": 256
},
"file": {
"type": "keyword",
"ignore_above": 256
},
"ext": {
"type": "keyword",
"ignore_above": 256
},
"h1": {
"type": "text"
},
"h2": {
"type": "text"
},
"h3": {
"type": "text"
},
"h4": {
"type": "text"
},
"hash": {
"type": "keyword",
"ignore_above": 256
},
"id": {
"type": "keyword",
"ignore_above": 256
},
"images": {
"properties": {
"external": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
},
"internal": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"italic": {
"type": "text"
},
"links": {
"properties": {
"external": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
},
"internal": {
"properties": {
"label": {
"type": "text"
},
"url": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"path": {
"type": "keyword",
"ignore_above": 256
},
"sim_hash": {
"type": "keyword",
"ignore_above": 256
},
"lang": {
"type": "keyword",
"ignore_above": 256
},
"screenshot_id": {
"type": "keyword",
"ignore_above": 256
},
"size": {
"type": "long"
},
"text": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"version": {
"type": "long"
}
}
},
"task": {
"properties": {
"breadth": {
"type": "long"
},
"created": {
"type": "date"
},
"depth": {
"type": "long"
},
"id": {
"type": "keyword",
"ignore_above": 256
},
"original_url": {
"type": "keyword",
"ignore_above": 256
},
"reference_url": {
"type": "keyword",
"ignore_above": 256
},
"schema": {
"type": "keyword",
"ignore_above": 256
},
"status": {
"type": "integer"
},
"updated": {
"type": "date"
},
"url": {
"type": "keyword",
"ignore_above": 256
},
"last_screenshot_id": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}'
please let me know that works.
Thank you medcl. That worked :).
1) - What are the types of pages that this crawls out of the box. Does this crawl pdf, docs etc and reads metadata?. If not, can you please give some pointers for the configuration? 2) - If I want to read a custom property from the page source while crawling, say metaxyz. Where is this configuration set, is that in the index mapping file?
Thanks.
there is no limit,Gopa use filter joint to filter what url you want and what isnotyou want,the config is with the pipeline config:
file_ext_match_rule:
should:
prefix: []
contain: []
suffix: []
must:
prefix: []
contain: []
suffix: []
must_not:
contain: [zip, exe, jar, js, css, rar, gz, zip, bmp, jpeg, gif, png, jpg, apk]
prefix: []
suffix: []
sure, you can use the extract joint to extract any dom object by using css selector, with the pipeline config as well:
- joint: extract
enabled: false
parameters:
html_block:
your_tag_name1: ".tag_class"
Closing, for other questions, please open another issue.
Thank You medcl
Can I run this project against localhost (elasticsearch). What are the changes that are needed in gopa.yml?
I tried and I am able to follow all the steps: see below:
The search results are empty, so not sure if the data is getting indexed.