Current model size is too big. Discontinuing training this model to search for other models.

dhlee-jubilee commented 6 years ago

Hi. I'm trying to apply autoKeras on own datasets. The dataset size is (1380, 299, 299, 3), but it's not working with "Current model size is too big. Discontinuing training this model to search for other models." error.

So, I resized images almost half, (1380, 128, 128, 3) and then tried again. However, it was working initially, but the moment of training 9th model, the same error again.

What's the problem? What's the limitation of model size?

rafaelmarconiramos commented 6 years ago

I had the same problem. This error is "out of memory problem." The searcher creates a very complex model, and don´t get to put to execute on GPU. I didn't find a solution. If you find, share with us.

haifeng-jin commented 6 years ago

@dhlee-jubilee @rafaelmarconiramos This is a expected behavior. It should not affect the search. Thanks.

exowanderer commented 6 years ago

@jhfjhfj1 Could you elaborate, please?

Your statement implies "this error is generated by autokeras", which is obvious. Your statement does not imply that this error is good or bad.

Does this mean that the model is not finding a good search criterion and that we should stop the algorithm / modify our data?

Or does it mean that the algorithm is working as expected and we should continue.

thank you

alessandropadrinofficial commented 6 years ago

@jhfjhfj1 got the same problem as @exowanderer !

malizheng commented 6 years ago

@jhfjhfj1 Training model 60
Using TensorFlow backend. Current model size is too big. Discontinuing training this model to search for other models. Search stuck in one model(model 60) , always in this problem.(Data:cifar10).

alpha358 commented 5 years ago

It seems to me that this problem would be solved if we implemented a model size constraint on A* search i.e. do not expand the nodes that represent too big models.

I think it could be implemented somewhere here: autokeras/search.py, line 187:

  if not self.training_queue:
                searched = True

                while new_father_id is None:
                    remaining_time = timeout - (time.time() - start_time)
                    new_graph, new_father_id = self.bo.optimize_acq(self.search_tree.adj_list.keys(),
                                                                    self.descriptors,
                                                                    remaining_time)
                new_model_id = self.model_count
                self.model_count += 1
                self.training_queue.append((new_graph, new_father_id, new_model_id))
                self.descriptors.append(new_graph.extract_descriptor())

Is this a reasonable idea ?

mlaradji commented 5 years ago

I am facing the same problem and it seems to be an infinite loop. In 'search.py' (line 219), we have the following:

        except RuntimeError as e:
            if not re.search('out of memory', str(e)):
                raise e
            if self.verbose:
                print('\nCurrent model size is too big. Discontinuing training this model to search for other models.')
            Constant.MAX_MODEL_SIZE = graph.size() - 1
            print(Constant.MAX_MODEL_SIZE) # Not in original file.
            return

As shown above, I have added a print(Constant.MAX_MODEL_SIZE), which showed that Constant.MAX_MODEL_SIZE stays constant (because graph.size() does not change) after each iteration of Current model size is too big. Discontinuing training this model to search for other models..

I'm unsure whether it's a mistake in the code or a misunderstanding on my part, but it seems that Constant.MAX_MODEL_SIZE should be reduced everytime there was not enough memory to load a model, for example by having Constant.MAX_MODEL_SIZE = max(1, round(Constant.MAX_MODEL_SIZE/2)) instead of Constant.MAX_MODEL_SIZE = graph.size() - 1.

Update

Using Constant.MAX_MODEL_SIZE = max(1, round(Constant.MAX_MODEL_SIZE/2)) does not work for me. Though the model is (seemingly) successfully loaded in memory, it is not being trained (GPU usage: Mem: 1908/2002 MB; Util: 0%). The output from autokeras shows that it is stuck:

Initializing search.
Initialization finished.

+----------------------------------------------+
|               Training model 0               |
+----------------------------------------------+

Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           0            |   24.642928409576417   |         0.2808         |
+--------------------------------------------------------------------------+

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+

Current model size is too big. Discontinuing training this model to search for other models.

Reduced Constant.MAX_MODEL_SIZE down to 16777216.

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+

Current model size is too big. Discontinuing training this model to search for other models.

Reduced Constant.MAX_MODEL_SIZE down to 8388608.

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+

Current model size is too big. Discontinuing training this model to search for other models.

Reduced Constant.MAX_MODEL_SIZE down to 4194304.

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+

Current model size is too big. Discontinuing training this model to search for other models.

Reduced Constant.MAX_MODEL_SIZE down to 2097152.

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+

Current model size is too big. Discontinuing training this model to search for other models.

Reduced Constant.MAX_MODEL_SIZE down to 1048576.

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+

Current model size is too big. Discontinuing training this model to search for other models.

Reduced Constant.MAX_MODEL_SIZE down to 524288.

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+

Current model size is too big. Discontinuing training this model to search for other models.

Reduced Constant.MAX_MODEL_SIZE down to 262144.

+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+
#Model is stuck at this point.#

qmpzzpmq commented 5 years ago

Constant.MAX_MODEL_SIZE = max(1, round(Constant.MAX_MODEL_SIZE/2))

I had this same problem, it seems graph.size() should be reduced under Constant.MAX_MODEL_SIZE, but it didn't.

JaeDukSeo commented 5 years ago

very interesting

JaeDukSeo commented 5 years ago

https://www.simonwenkel.com/2018/09/05/autokeras-german-traffic-sign-recognition-benchmark.html maybe this can help

keras-team / autokeras

Current model size is too big. Discontinuing training this model to search for other models. #236

Update