Open RickyLau opened 6 years ago
working on it.
After setting up GOPA, its in Start stage and no way to stop it. Also not sure if its indexing something
hi Medcl, i am so excited to find GOPA as it seems to be promising for internal site search that i am trying to build - however i can get it to work - can you please help ?
are you building from the source, or download from the lastest release? @Jasmi77
@Jasmi77 the master branch is under heavy development, I suggest you download the v0.10 released package (https://github.com/infinitbyte/gopa/releases/tag/v0.10.0) and read this README https://github.com/infinitbyte/gopa/tree/v0.10.0,note this version only support SQLite as persist database(for tasks).
Hi - I'm new to ElasticSearch and have been experimenting with Gopa. I'm also having trouble understanding how to 'stop' the crawler. I've pointed it at a dev version of our site, and it seems to find just over 100 documents, but continues to generate a lot of tasks. It seems to be continuously crawling the site over and over.
The site is fairly static, so what I would like to do is have Gopa crawl the site once, and then we can re-index as content is updated. Is it possible to configure Gopa to do that? Or to know when it has finished its initial crawl?
Hi, @daveX99 I am a little busy recently (the best excuse I've got :) ) , and regarding your question:
About it generate a lot of tasks, can you check out the task API: http://localhost:8001/tasks/
to see what's inside, the crawler automatically follow all the links in your site, you can filter then in this config section: https://github.com/infinitbyte/gopa/blob/master/gopa.yml#L52
This is easy, by default, Gopa will try to check the site for updates periodically , you can config this as well: https://github.com/infinitbyte/gopa/blob/master/gopa.yml#L183
@medcl :
I'm sure you are busy, so I appreciate your quick response.
I played a bit with the parameters to limit the URLs and that fixed my problem.
There are some oddities in the links on the site I am indexing, and that was causing a weird recursion in gopa. Once I set the parameters under url_match_rule, must_not, contain
to exclude this link, the indexing ran to completion, and all succeeded.
I will probably need to play with the configuration in gopa.yml some more to fine tune the indexing. Is there any documentation on how those keys/values work?
Thanks again, -dave.
the documents is a issue, and few tips of the configuration:
checker
, another one is used to fetch resource and parse the content and save to elasticsearch.joint
means a process step of the pipeline flow.max_go_routine
parameter, which means how many concurrent tasks will be running.@medcl :
I will keep that in mind. I am still learning the basics of how all this fits together. At this point, I am able to index the site with gopa and get the data into elasticsearch. Indexing does not take more than a few minutes now.
If I have further questions, I will create a new question to the issue queue so that this one is not filled with off-topic issues.
Thanks again for your quick replies! -dave.
How to crawler stop running?