keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.68k stars 19.42k forks source link

KERAS stops executing randomly while adding first layer inside docker container #14811

Closed rushnaulaziz closed 2 years ago

rushnaulaziz commented 3 years ago

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/keras

If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with the documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.

Here's why we have that policy: Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with:

python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the problem

I have created a classification model using Keras 2.4.3, tensorflow-cpu 2.5.0 and Python 3.9.5. The model works fine in on my Windows 10 development environment.

However, when I deploy my code in a Docker container, the code gets stuck. Specifically, it gets stuck when I add a LSTM (Long Short-term Memory) object in the Sequential model.

log.info("Adding LSTM as input layer ")
model.add(LSTM(100,  input_shape=(
        train_x.shape[1:]), return_sequences=False))

The LSTM is the first layer I add so the code gets stuck right at the start. To be clear, this works fine when I do not use a container and deploy directly on my Windows 10 laptop.

This behaviour is random (has happened 4th and 20th time I ran the model and any number of times in between). The training is running in separate process which is created using the standard Python multiprocessing module. Even when my model I stuck, I do not see anything out of the ordinary when I run docker ps.

Source code / logs

Model Structure

try:
        log.info("Initializing Sequential Model")
        model = Sequential()

        log.info("Initializing GlorotNormal")
        initializer = initializers.GlorotNormal()

        log.info("Adding LSTM as input layer ")
        model.add(LSTM(100,  input_shape=(
            train_x.shape[1:]), return_sequences=False))

        log.info("Adding hidden dense layer")
        model.add(Dense(64, activation='selu', name="layer2", 
            kernel_initializer=initializer))

        log.info("Adding Dropout")
        model.add(Dropout(rate=0.5))

        log.info("Adding Output layer")
        model.add(Dense(len(intent_tags), activation='softmax', name="layer3"))

        log.info("Generating model Summary")
        model.summary()

        log.info("Compiling model")
        model.compile(loss='categorical_crossentropy', optimizer=
           tf.keras.optimizers.Adamax(learning_rate=0.005), metrics=['accuracy'])

        log.info("Model Compiled succesfully")

Model fit:

model: Sequential = create_training_model(train_x, train_y, intent_tags)
log.info("Model Created")

add_into_queue: LambdaCallback = LambdaCallback(on_epoch_end=lambda epoch,_: queue.put({"type": "progress", "sub_type": "training_progress", "progress": f'EPOCHS: {epoch+1}/{configuration_epochs}'}))
es: EarlyStopping = EarlyStopping(monitor='loss', mode='min',verbose=1, patience=30, min_delta=1)
log.info("fitting Training")

history: object = model.fit(train_x, train_y, epochs=200, batch_size=5,
                  verbose=1, validation_data=(test_x, test_y), 
                  callbacks=[es, add_into_queue])

if es.stopped_epoch:
      training_completed_message: str = f"Training completed {es.stopped_epoch}/{configuration_epochs} Epoch, Early Stopping applied"
      log.info(training_completed_message)

      progress_data: dict = {"type": "progress", "sub_type":"training_completed"  , "progress": str(training_completed_message)}
      queue.put(progress_data)

else:
      progress_data: dict = {type": "progress", "sub_type": "training_completed","progress": str(configuration_epochs)}
      queue.put(progress_data)

Fastapi websocket code snippet for training model:

try:
    configuration["TRAINING_COUNT"] +=1
    log.info(f"Training Count: {configuration['TRAINING_COUNT']}")
    log.info("Starts training on seprate procces")
    multi_process = Process(target=chatbot_training, args=(qestions_answers, training_type, client_id, saved_file_path, queue), name=f"training_process_{client_id}")
    multi_process.start()

   log.info("Initializing thread to send training progress")
   data_progress_thread = threading.Thread(target = send_data_progress_call, args=[websocket, queue] , name="data_progress_thread")
   data_progress_thread.daemon = True
   data_progress_thread.start()

Dockerfile

FROM python:3.9.5-slim-buster
COPY ./ /app
WORKDIR /app
RUN pip install -r requirements.txt && \
python -m nltk.downloader punkt && \
python -m nltk.downloader wordnet && \
python -m nltk.downloader averaged_perceptron_tagger && \
python -m pip cache purge
ENV PYTHONHASHSEED=100
CMD ["python", "./starfighter/app.py"]

Docker Container logs: a3146378-0b31-4aac-a30a-1f02af29d93b

Result of docker stats [container_name]: da85e9c2-5878-44eb-b76f-54e392defb5b

Results of docker top [container_name]: 6eeae81d-c34e-4ab9-b135-7b859c3691a3

Logs on development environment image

Steps to reproduce:

train model 20-40 times to reproduce the error, for saving time use small dataset

Environment information

Server OS = Centos 7
docker base image = python:3.9.5-slim-buster
Python Version = 3.9.5
tensorflow-cpu==2.5.0
keras==2.4.3
nltk==3.5
pyspellchecker==0.6.2
pandas==1.2.4
fastapi==0.65.1
aiofiles==0.7.0
openpyxl==3.0.7
websockets==9.0.2
numpy==1.19.5
strictyaml
uvicorn==0.13.4
PyYAML==5.4.1
sushreebarsa commented 2 years ago

@rushnaulaziz Sorry for the late response! In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Could you please try to execute your code using latest TF v2.6.0 and refer to similar issue1,issue2 ,lat us know if it helps?Thanks!

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please reopen if you'd like to work on this further.