heycarsten / lcbo-api

A crawler and API server for Liquor Control Board of Ontario retail data
https://lcboapi.com
GNU General Public License v3.0
184 stars 44 forks source link

No Crawler information on running crawler #24

Open alagori opened 5 years ago

alagori commented 5 years ago

there is only a brief mention of the crawler but no instructions on how to run the crawler. if you could post the commands to run the crawler id be more then happy to update the read me with the information and a guide on how to use it.

craftdelivery commented 5 years ago

you can log into the container then run the cron rake task

docker exec -it lcbo-api_app_1 /bin/bash
rake cron

or by docker compose:

docker-compose exec app rake cron
chimemeh commented 5 years ago

how can one know when the crawl has completed?

craftdelivery commented 5 years ago

it takes a long time to complete. I got several errors near the end related to saving json to s3 but the crawl was a success. open a rails console and check the counts

image

chimemeh commented 5 years ago

I followed the instruction on Readme file, so my database already has the data from the January pull (i.e. the count() would return values). It appears that the database is not being refreshed with the latest data, hence why I'm not sure the crawl is actually active.

FYI, I am also new to Rails and Docker.

craftdelivery commented 5 years ago

I didn't pre populate the data as specified in the README file but you should be able to run the crawler in any case. He called the task cron because that's how it was setup (to run at an interval)

in this case it was triggered by the linux os in the docker containter. see: config/crontab.txt

Its overkill for everybody who clones the repo to do this on a daily basis so just run it manually once in a while: docker-compose exec app rake cron

You will notice if its running as there is terminal output and its very intensive on your machine

If you look in lib/tasks/cron.rake you will see:

desc 'Run scheduled tasks'
task cron: :environment do
  Crawler.run
end
chimemeh commented 5 years ago

I'm guessing the Crawler is run automatically when you execute the command "docker-compose up"? I tried the command "docker-compose exec app rake cron" and get

rake aborted! Crawl is already running /lcboapi/app/models/crawl.rb:47:in init' /lcboapi/lib/crawler.rb:5:ininit' /lcboapi/lib/boticus/bot.rb:40:in run' /lcboapi/lib/tasks/cron.rake:3:inblock in

' Tasks: TOP => cron (See full trace by running task with --trace)

craftdelivery commented 5 years ago

I'm getting that as well trying to run it a second time. I think its got something to do with Crawler state. Give me a minute...

craftdelivery commented 5 years ago

run this in rails console Crawl.where(state: [:init, :running, :paused])

app/models/crawl.rb is_active checks for these states and will exit withCrawl is already running

run this in rails console then run the cron task: Crawl.where(state: [:init, :running, :paused]).destroy_all

chimemeh commented 5 years ago

The second command generated some error messages - not sure if it's normal. Then running the cron task showed the same "Crawl is already running" message. By the way, really appreciate you helping out!

Below is the output from executing the commands in rails.

Loading development environment (Rails 5.2.2) [1] pry(main)> Crawl.where(state: [:init, :running, :paused]) => Crawl Load (2.7ms) SELECT "crawls". FROM "crawls" WHERE "crawls"."state" IN ($1, $2, $3) [["state", "init"], ["state", "running"], ["state", "paused"]] [#<Crawl:0x000055decdfbe9e0 id: 2810, crawl_event_id: nil, state: "init", task: nil, total_products: 0, total_stores: 0, total_inventories: 0, total_product_inventory_count: 0, total_product_inventory_volume_in_milliliters: 0, total_product_inventory_price_in_cents: 0, total_jobs: 0, total_finished_jobs: 0, store_ids: [], product_ids: [], added_product_ids: [], added_store_ids: [], removed_product_ids: [], removed_store_ids: [], created_at: Sun, 07 Apr 2019 01:28:49 UTC +00:00, updated_at: Sun, 07 Apr 2019 01:28:49 UTC +00:00>] [2] pry(main)> Crawl.where(state: [:init, :running, :paused]).destroy_all Crawl Load (2.4ms) SELECT "crawls". FROM "crawls" WHERE "crawls"."state" IN ($1, $2, $3) [["state", "init"], ["state", "running"], ["state", "paused"]] (0.5ms) BEGIN Crawl Destroy (2.0ms) DELETE FROM "crawls" WHERE "crawls"."" = $1 [["", 2810]] (0.4ms) ROLLBACK ActiveRecord::StatementInvalid: PG::SyntaxError: ERROR: zero-length delimited identifier at or near """" LINE 1: DELETE FROM "crawls" WHERE "crawls"."" = $1 ^ : DELETE FROM "crawls" WHERE "crawls"."" = $1 from /usr/local/bundle/gems/activerecord-5.2.2/lib/active_record/connection_adapters/postgresql_adapter.rb:611:in `async_exec_params' Caused by PG::SyntaxError: ERROR: zero-length delimited identifier at or near """" LINE 1: DELETE FROM "crawls" WHERE "crawls"."" = $1 ^

from /usr/local/bundle/gems/activerecord-5.2.2/lib/active_record/connection_adapters/postgresql_adapter.rb:611:in `async_exec_params' [3] pry(main)>

craftdelivery commented 5 years ago

try Crawl.find(2810).destroy use any id returned by Crawl.where(state: [:init, :running, :paused])

or try reinstalling everything without importing the old data...

chimemeh commented 5 years ago

Thanks for the suggestion, I'm not sure why it didn't work. I finally just deleted the db image docker rm lcbo-api-master_app_1 then restarted docker-compose up -d then executed cron docker-compose exec app rake cron

and it's crawling finally! yay! thanks again for all your help.

joMclellan commented 4 years ago

Where did you find the db image? I'm having the same issue with my crawler @chimemeh

craftdelivery commented 4 years ago

i believe it will be created on initialization of the rails app or on the first crawl. What do you have so far?