Dockerize Harvester for development purposes

mkrcah commented 7 years ago

This PR is WIP and is not complete.

The goal is to start the Harverster and all it's dependencies with one command: docker-compose up. This command would start up Postgres & Redis, initalize the db and start the server.

This approach has the following advantages:

implicit documentation of required system dependencies
rapid improvement in onboarding of new contributors (just have the docker installed)
identical environments across developers (and potentially across environemnt), if they opt to use docker
non-invasive backwards-compatible improvement: current developers and deployment pipelines can stay identical, without the need to use Docker

To complete the PR, there are however a few open questions:

Web/REST API?

Does the application serve a website or a REST API? If not, what is the purpose of web: bundle exec puma -C config/puma.rb in the Procfile?
There is a lot of empty web-based boilerplate code. Removing such code might simplify onboarding of new contributors, especially those not familiar with Ruby and Rails framework. Is it common in Rails project to use the same boilerplate for all projects? Another option would be to strip down Rails and use plain Ruby scripting with selected libraries.

Required system dependencies

It seems that the application requires:
- Postgres to store scraped data
- Redis for job management (only for prod)
Is there any other required system dependency?
Currently, Redis is required only for Production. It might be interesting to spin up Redis for development as well with Docker.

How to run?

How to start the server, incl. the job scheduler (defined in clock.rb)? Is it bin/setup?
Is there a way to selectively run only part of the pipeline, e.g. only itms:all:sync jobs? If yes, how?
How do you usually run?

Gems location What is a common location for Ruby apps to install gems locally? When using Docker, it is a common practice to install the dependencies locally in order to increase the container startup time. Currently, I set the target directory to /.bundle.

jsuchal commented 7 years ago

Harvester app is just a backend worker / scheduler. the web app proces there... hm not sure why this is there. We are using dokku for deployment, maybe it needs some web worker, will check that.

We had an internal discussion about using only plain old ruby for this, but I am not a fan for a simple reason: From my experience, we need to enforce strict rules on project. to simplify onboarding, its just best to pick a de-facto standard for conventions (rails). yes, rails has its quirks, but you have gems, jobs, testing, autoloading & all nice tools every ruby/rails developer knows in its place. I've seen too many projects that were not using rails and every time the onboading a development was a lot harder.

if someone really wants to start pushing data to ecosystem/datahub from non-standard env, its doable. we just create you a schema and a user that can write to our master database on that schema. we only need to agree on table naming standard. you can use any language you want, but we can't guarantee any maintenance outside of our standard ruby/rails stack.

depts: redis/postgres is right. i would stick to compose.yml only for external dependencies excluding ruby.

if anyone wants to do ruby/rails development i asume he is familiar with rbenv/rvm and has ruby/bundler installed. We use .ruby-version a Gemfile so the dependency versions are locked anyway. I don't see a point dockerizing this part, unless someone wants to run this without any ruby/rails knowledge. Not sure why anyone would want to do that on this particular project.

mkrcah commented 7 years ago

Thanks for the update, I have updated the code accordingly.

compose only for external deps, not Ruby
updated Readme with how to start development - is it correct?

I'm still struggling to get the job running. I submit a new job to the queue with rake itms:all:sync and see the job being queued in the log/development.log. However, there is no worker picking the job. How do I start and monitor the worker?

script/rails: I understand. I saw similar challenge in Python ETL apps pushing from non-ruby: I will try to get my hands dirty with Ruby first, if that's ok :) ruby-in-docker: one advantage is for polyglot engineers who work on different stacks on one machine. there is no need to "pollute" the host machine with different installations of Python,Ruby, etc, all is encapsulated in Docker, incl. the interpreter. Recent Intellij IDEs can even hook into a remote interpreter running in Docker. I'll stick with on-host Ruby for now for Harvester.

jsuchal commented 7 years ago

If you want to start the app/harvester you need to start the worker proces. Look at https://github.com/slovensko-digital/harvester.ecosystem/blob/master/Procfile and https://ddollar.github.io/foreman/ this is a good way to run it. just foreman start and you are done.

mkrcah commented 7 years ago

Great, I got the Harvester fully up and running :) I have updated the README according to your remarks.

jsuchal commented 7 years ago

Thanks!

slovensko-digital / harvester.ecosystem

Dockerize Harvester for development purposes #20