nginyc / rafiki

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.
Apache License 2.0
36 stars 23 forks source link

code review Sep 23 #41

Closed nudles closed 5 years ago

nudles commented 5 years ago

Since we have a running version now, we can stop for a while on the functionality and try to make the code clean/tidy including the naming, folder structure, documentation. I am reviewing the code and listing some issues. Pls let me know your comments. Once we reach an agreement, we can proceed to update the code. The goal is to make the code easy to read and the system easy to use (which is more important than extensibility to me).

  1. shall we separate the db, admin and advisor into 3 dockers, or launch them in a single docker? if we put them together, it will simplify the launching process. the drawback is that if one of them fails, we have to restart the whole docker container. any other related issues?

  2. cache.Dockerfile and db.Dockerfile are not necessary? just pull from docker hub for the redis and postgres

  3. rename the workdir as /root/rafiki ? app looks like rafiki application. rename model.Dockerfile with worker.Dockerfile? any better name (a single word) for query_frontend?

  4. shall we rearrange the variables in .env.sh and src/config.py? e.g., put the environment variables (for running other tools, the db, redis, etc.) like postgres host and port in env.sh, and put rafiki variables (introduced by rafiki) like super user in config.py. another solution is to put variables read by python programs into config.py and variables read by bash into env.sh (any variable is read by both python and bash)? the good thing of using config.py is that we copy the files into the docker container instead of exporting the variable explicitly. later when we want to change or add a variable, we do not need to change the code.

  5. should the self._db.create_inference_job_worker be called inside self._create_service. In other words, the former should always be called inside the later.

  6. stop_train_job_worker is not called by stop_train_job_services?

  7. rename stop_train_job_services to stop_train_services?

  8. Can merge ServiceManager.py into Admin.py? The code in both files are similar, most of them are interacting with the databases. ServiceManager does not have many attributes/properties and Admin has only one ServiceManager instance.

  9. Let's unify the notations for hyper-parameter tuning: one knob is one hyper-parameter; one trial is the one assignment of all knobs of a model; one study tunes the hyper-parameters until the given budget uses up.

  10. I do not quite understand the advisor folder. From my view, I think the advisor service can be implemented in this way: a keep a dict of (job_id->advisor instance) in app.py; new advisor instance is inserted; next trial is generated by calling the corresponding advisor; trial and result is appended into a database; for failure recovery (low priority), we can recover the advisor by feeding the pairs from the database into the advisor.

  11. Move the VGG and SingleHiddenLayerTensorflowModel into a top folder example. Only keep the base model in src.

  12. Try to use flatten folder structure as much as possible. The advantage is that the import statement would be simpler. It is not common to see a requirements.txt in every subfolder. If one project has multiple sub-packages that can be installed independently, then we better have mulitple requirments.txt (one for each sub-package). For other cases, we usually put the requirement.txt at the top folder.

  13. Use consistent naming style. Name all files like xxx_yyy.py and all classes like XxxYyy.

14 Move the json files at the root folder into somewhere else?

nginyc commented 5 years ago

Since we have a running version now, we can stop for a while on the functionality and try to make the code clean/tidy including the naming, folder structure, documentation. I am reviewing the code and listing some issues. Pls let me know your comments. Once we reach an agreement, we can proceed to update the code. The goal is to make the code easy to read and the system easy to use (which is more important than extensibility to me).

Understood. When I wrote the code from scratch, I had extensibility in mind, at the same time trying to make the code as neat as possible. However, it can cause code to be a bit verbose sometimes. I will take note that the focus should be on making it easy to read, as well as keeping it well documented.

  1. shall we separate the db, admin and advisor into 3 dockers, or launch them in a single docker? if we put them together, it will simplify the launching process. the drawback is that if one of them fails, we have to restart the whole docker container. any other related issues?

From what I understand about Docker containers, I highly recommend against running multiple apps/processes in the same Docker container. I think it is easier to manage and deploy components in distinct Docker containers. Check out the following articles:

If you're worried about complexity of launching, I feel like the scripts/start.sh and scripts/stop.sh I wrote handles the configuration, deployment and stopping of Rafiki with 1 command.

  1. cache.Dockerfile and db.Dockerfile are not necessary? just pull from docker hub for the redis and postgres

Ok good point. I'll remove them and directly build the default redis & postgres images.

What I was thinking:

  1. rename the workdir as /root/rafiki ? app looks like rafiki application. rename model.Dockerfile with worker.Dockerfile? any better name (a single word) for query_frontend?

Will rename work directory

Will rename rafiki_model to rafiki_worker, model.Dockerfile to worker.Dockerfile

For query_frontend, qfe (but I find it confusing and uninformative), query?

  1. shall we rearrange the variables in .env.sh and src/config.py? e.g., put the environment variables (for running other tools, the db, redis, etc.) like postgres host and port in env.sh, and put rafiki variables (introduced by rafiki) like super user in config.py. another solution is to put variables read by python programs into config.py and variables read by bash into env.sh (any variable is read by both python and bash)? the good thing of using config.py is that we copy the files into the docker container instead of exporting the variable explicitly. later when we want to change or add a variable, we do not need to change the code.

Right now, most of the configuration of each component is done through environment variables, and the Python programs read from environment variables. I think some of these configuration variables are required by both Bash & Python:

Only Python: APP_SECRET (for common authentication in Python across all components), SUPERADMIN_EMAIL/SUPERADMIN_PASSWORD (for creating the superuser & train workers to authenticate against admin to tell admin when to stop train jobs)

Only Bash: DOCKER_NETWORK (for launching all components in the same network), RAFIKI_IMAGE_ADMIN/RAFIKI_IMAGE_ADVISOR (for building their images & launching with correct images)

Both Python & Bash: POSTGRES_HOST/POSTGRES_PORT/POSTGRES_USER/POSTGRES_DB/POSTGRES_PASSWORD/ADMIN_HOST/ADMIN_PORT/ADVISOR_HOST/ADVISOR_PORT/REDIS_HOST/REDIS_PORT (to launch static components and have components connect to one another successfully), LOGS_FOLDER_PATH (to mount the logs folder from host to each component while launching them in Docker containers & allowing the admin to know the value of the logs folder to forward it to dynamically deployed services e.g. workers), RAFIKI_IMAGE_WORKER/RAFIKI_IMAGE_QUERY_FRONTEND (for building their images & allowing the admin to know their values for dynamic deployment), RAFIKI_IP_ADDRESS (for advertising the right Docker swarm master address and returning the correct query frontend public IP address)

I will be removing some unnecessary configuration, moving Python-only configuration into config.py and commiting a default .env.sh to remove the need for creating the env file during setup.

How I see environment variables is that they are the most portable/reliable way of configuring a program & reading configuration as a program (e.g. if I deploy a Python program with Docker's Python SDK that requires dynamic configuration, I can be sure I can pass in environment variables easily, but it becomes more complicated if I have to dynamically edit a Python configuration file and pass it to the Docker container), and that's why I relied as much as possible on it.

  1. should the self._db.create_inference_job_worker be called inside self._create_service. In other words, the former should always be called inside the later.

Don't quite understand this. Service is a generic abstraction of a Docker Swarm (or Kubernetes) service - a program with a set number of running replicas. An inference worker is an instance of a service with more metadata e.g. model trial to load. When an inference job is created:

  1. stop_train_job_worker is not called by stop_train_services?

It is? Line 75 of ServicesManager.py

  1. rename stop_train_services to stop_train_services?

Will do

  1. Can merge ServiceManager.py into Admin.py? The code in both files are similar, most of them are interacting with the databases. ServiceManager does not have many attributes/properties and Admin has only one ServiceManager instance.

Originally they were in the same Admin.py file, but I broke them up into 2 files more for readability (didn't want Admin.py to be too long and I wanted to group all the logic for services deployment in ServiceManager.py). Tell me if I should still combine it, but I feel it's better to keep them separate?

  1. Let's unify the notations for hyper-parameter tuning: one knob is one hyper-parameter; one trial is the one assignment of all knobs of a model; one study tunes the hyper-parameters until the given budget uses up.

Okay. I believe I have renamed all hyperparameter terms to knob. Is there any unification to do in the codebase?

  1. I do not quite understand the advisor folder. From my view, I think the advisor service can be implemented in this way: a keep a dict of (job_id->advisor instance) in app.py; new advisor instance is inserted; next trial is generated by calling the corresponding advisor; trial and result is appended into a database; for failure recovery (low priority), we can recover the advisor by feeding the pairs from the database into the advisor.

When I coded advisor, my idea was to make it generic, extensible and informative, since we might be exposing its service to Rafiki's users. Therefore, I had the abstraction of Proposal (a set of knob values), manipulated Advisor and Proposal as objects with properties & relationships (see rafiki/advisor/store/Advisor.py and rafiki/advisor/store/Proposal.py), kept an in-memory store of {advisor_id -> advisor} and {proposal_id -> proposal} (see rafiki/advisor/store/AdvisorStore.py), and defined an abstract class rafiki/advisor/advisors/BaseAdvisor.py that concrete advisor classes should extend from. This way, we can swap out (or have the user configure) the implementation of advisor (e.g. use hyperopt instead of GP).

If you think that we should still "dumb-down" the code here, let me know.

  1. Move the VGG and SingleHiddenLayerTensorflowModel into a top folder example. Only keep the base model in src.

Okay, will do so

  1. Try to use flatten folder structure as much as possible. The advantage is that the import statement would be simpler. It is not common to see a requirements.txt in every subfolder. If one project has multiple sub-packages that can be installed independently, then we better have mulitple requirments.txt (one for each sub-package). For other cases, we usually put the requirement.txt at the top folder.

My logic for having multiple requirements.txt for most rafiki.XXX modules is that:

Is it better if we combine all into 1 requirements.txt that install all dependencies required by all components of Rafiki for all images?

Where else should flattening of folder structure happen?

  1. Use consistent naming style. Name all files like xxx_yyy.py and all classes like XxxYyy.

Aside from rafiki/cache/cache.py (will rename), I don't see any file naming inconsistency...? Can you point out if there are any others?

14 Move the json files at the root folder into somewhere else?

Ok. Those are postman collections that belong more in docs/ (I think). I'll move them there