nginyc / rafiki

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.
Apache License 2.0
36 stars 23 forks source link

Random uuid name cause potential "No module named xxx" error during load parameters #159

Open vivansxu opened 5 years ago

vivansxu commented 5 years ago

I encountered a "No module named xxx" error when loading parameter of my model is called when launching an inference job. Here is the error trace:

2019-07-10 02:21:07,256 rafiki.utils.service INFO Starting worker "75be99ec25a6" for service of ID "614d740e-9791-4c64-aafe-dc17cf7e7866"... 2019-07-10 02:21:07,511 rafiki.worker.inference INFO Starting inference worker for service of id 614d740e-9791-4c64-aafe-dc17cf7e7866... 2019-07-10 02:21:07,519 rafiki.cache.cache INFO add_worker_of_inference_job:INFERENCE_WORKERS_b6592484-deb4-4df2-bce3-ffc82d9a125a=614d740e-9791-4c64-aafe-dc17cf7e7866 2019-07-10 02:21:09,131 rafiki.utils.service ERROR Error while running worker: 2019-07-10 02:21:09,131 rafiki.utils.service ERROR Traceback (most recent call last): File "/root/rafiki/utils/service.py", line 31, in run_worker start_worker(service_id, service_type, container_id) File "scripts/start_worker.py", line 24, in start_worker worker.start() File "/root/rafiki/worker/inference.py", line 41, in start self._model = self._load_model(trial_id) File "/root/rafiki/worker/inference.py", line 91, in _load_model model_inst.load_parameters(parameters) File "/root/e4568ce2-9d44-47b8-ac7f-1e8143168140.py", line 235, in load_parameters ModuleNotFoundError: No module named '797342b4-9d38-432f-91f6-727eac25db71'

After debugging I figured that it is a potential bug of Rafiki and pickle. This bug is caused by pickling self-defined class objects(defined in model source code). Pickle requires the pickled object's class to be importable during pickle.loads(), by using the same import path memorized during pickle.dumps. However, each time a train trail or inference job is launched, a random UUID name will be given to the model source code file name. This caused the inconsistency of import path during dumping and loading. This bug is not revealed because currently, the models in Rafiki are only pickling imported class object or python "primitives". Their import path is consistent. Potential fix for this bug could be:

  1. Change randomly generated file name to the hash of something (e.g. model name + trail id), then use the same way of hashing for both train job and inference job.
  2. Remember the generated name during train job and use the same name during inference job. (Model.load_model_class do take the third parameter "temp_mod_name" but it is never called except in "test_model_class")
  3. Change the way of importing the model source file. (Not sure)

Thank you!

nginyc commented 5 years ago

Hi @vivansxu, thanks for the bug report. I got the gist of the bug. To better understand, can you provide the implementation (or description of the implementation) of your model, or specifically for the load_parameters method?

vivansxu commented 5 years ago

Hi @nginyc, thanks for your reply. Following are my dump_parameters() and load_parameters()

def dump_parameters(self):
    params = {}
    with tempfile.NamedTemporaryFile() as tmp:
        pickle.dump((self.G, self.D, self.Gs), tmp, protocol=pickle.HIGHEST_PROTOCOL)
        with open(tmp.name, 'rb') as f:
            h5_model_bytes = f.read()
        params['h5_model_base64'] = base64.b64encode(h5_model_bytes).decode('utf-8')
    return params
def load_parameters(self, params):
    h5_model_base64 = params.get('h5_model_base64')
    with tempfile.NamedTemporaryFile() as tmp:
        h5_model_bytes = base64.b64decode(h5_model_base64.encode('utf-8'))
        with open(tmp.name, 'wb') as f:
            f.write(h5_model_bytes)
        unpickler = pickle.Unpickler(tmp)
        self.G, self.D, self.Gs = unpickler.load()

self.G, self.D and self.Gs are all Network objects, where Network is a class I defined in my model file.

Thank you!

nginyc commented 5 years ago

Ok I see. Do you want to try making a PR to fix this? It seems like you could be already onto a fix. I would consider option 1.

Thanks for the help!