ourresearch / openalex-topic-classification

MIT License
17 stars 4 forks source link

Corrupted or malformed citation_part_only.keras file #2

Closed DavidGeorge528 closed 4 months ago

DavidGeorge528 commented 4 months ago

I'm trying to run the topic classifier myself using the provided code and downloaded model files. However, while everything else loads and runs properly, the load weights function (below) crashes with an error that seems to show that the file being loaded is corrupted. https://github.com/ourresearch/openalex-topic-classification/blob/e91c2f45ef66611f438447ea29ae6b5f03b7d2f6/v1/003_Deployment/model_to_api/container/topic_classifier/predictor.py#L538

[2024-05-01 09:25:07 +0100] [16753] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/Users/user/.venv/lib/python3.11/site-packages/gunicorn/arbiter.py", line 609, in spawn_worker
    worker.init_process()
  File "/Users/user.venv/lib/python3.11/site-packages/gunicorn/workers/ggevent.py", line 147, in init_process
    super().init_process()
  File "/Users/user/.venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/Users/user.venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
                ^^^^^^^^^^^^^^^
  File "/Users/user/.venv/lib/python3.11/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
                    ^^^^^^^^^^^
  File "/Users/user.venv/lib/python3.11/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.venv/lib/python3.11/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.venv/lib/python3.11/site-packages/gunicorn/util.py", line 371, in import_app
    mod = importlib.import_module(module)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/georged/.pyenv/versions/3.11.2/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/Users/user/oa_model/wsgi.py", line 4, in <module>
    myapp.start_api(app)
  File "/Users/user/oa_model/predictor.py", line 799, in start_api
    pred_model = create_model(len(target_vocab), 
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/oa_model/predictor.py", line 538, in create_model
    model.load_weights(model_chkpt)
  File "/Users/user/.venv/lib/python3.11/site-packages/tensorflow/python/keras/engine/training.py", line 2340, in load_weights
    with h5py.File(filepath, 'r') as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.venv/lib/python3.11/site-packages/h5py/_hl/files.py", line 562, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.venv/lib/python3.11/site-packages/h5py/_hl/files.py", line 235, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 102, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

I have double checked both the file being loaded is correct and re-downloaded the file multiple times from here to ensure that it downloaded correctly. All times I get the same error, which after searching online, suggests that the file is corrupted. Is there anyway someone can verify this is the case and provide the uncorrupted file?

Thanks in advance

jpbarrett13 commented 4 months ago

Hi David, what OS are you using and also which version of tensorflow do you have installed? I believe I have seen this error before but it was because I had an earlier version of tensorflow installed and the file format was not compatible with other versions of tensorflow. I can look into this further to confirm once you answer these questions.

DavidGeorge528 commented 4 months ago

Hi Justin, I've tried it locally on MacOS and remotely on AWS with an Ubuntu 20.04 server. On both I've tried the latest tensorflow and 2.13 as per https://github.com/ourresearch/openalex-topic-classification/blob/main/v1/requirements.txt. Neither work for me. Thanks in advance for looking into it

DavidGeorge528 commented 4 months ago

The error seems to be with h5py, which is what TF is using to load the file, I have 3.11.0 installed, which is what tf 2.13 installed as a dependency

jpbarrett13 commented 4 months ago

I have uploaded a new keras file to the zenodo link: https://zenodo.org/records/11221637

Please try that and if that doesn't work, let me know. I do not think that is going to solve the issue but I want to try the easiest thing first before I get further into it.

DavidGeorge528 commented 4 months ago

Hi thanks for the quick reply, unfortunately that didn't work either. Perhaps its the h5py version?

jpbarrett13 commented 4 months ago

That is potentially an issue, I have h5py==3.10.0 listed in the requirements file. I can look into this soon and get back to you with what I find.

DavidGeorge528 commented 4 months ago

I've tried it with h5py==3.10.0 and still getting the same error unfortunately

jpbarrett13 commented 4 months ago

One other quick question before I look into it, are you trying to set up your own docker container or are you just taking the predict.py code and trying to run that on it's own?

DavidGeorge528 commented 4 months ago

Initially I was trying to run the code as is in docker. But when I hit some of the issues I simplified the issue to just the create_model function and was trying to run it locally on my Mac (And even on a simpler docker container that just loads the model and nothing else)

jpbarrett13 commented 4 months ago

So unfortunately I am unable to reproduce your error. I reduced the code down to the minimum needed in order to load the model and I also used the exact file I uploaded to zenodo:

import os
import pickle
import tensorflow as tf

prefix = './model_artifacts/'
model_path = os.path.join(prefix, 'model')

#### Load the needed files
with open(os.path.join(model_path, "target_vocab.pkl"), "rb") as f:
    target_vocab = pickle.load(f)

print("Loaded target vocab")

with open(os.path.join(model_path, "citation_feature_vocab.pkl"), "rb") as f:
    citation_feature_vocab = pickle.load(f)

print("Loaded citation features vocab.")

#### Load the model

def create_model(num_classes, emb_table_size, model_chkpt, topk=5):
    # Function to create full model.

    # Input:
    # num_classes: number of classes
    # emb_table_size: size of embedding table
    # model_chkpt: path to model checkpoint
    # topk: number of predictions to return

    # Output:
    # model: full model

    # Inputs
    citation_0 = tf.keras.layers.Input((16,), dtype=tf.int64, name='citation_0')
    citation_1 = tf.keras.layers.Input((128,), dtype=tf.int64, name='citation_1')
    journal = tf.keras.layers.Input((384,), dtype=tf.float32, name='journal_emb')
    language_model_output = tf.keras.layers.Input((512, 768,), dtype=tf.float32, name='lang_model_output')

    # Create a multi-class classification model using functional API
    pooled_language_model_output = tf.keras.layers.GlobalAveragePooling1D()(language_model_output)
    citation_emb_layer = tf.keras.layers.Embedding(input_dim=emb_table_size, output_dim=256, mask_zero=True, 
                                                   trainable=True, name='citation_emb_layer')

    citation_0_emb = citation_emb_layer(citation_0)
    citation_1_emb = citation_emb_layer(citation_1)

    pooled_citation_0 = tf.keras.layers.GlobalAveragePooling1D()(citation_0_emb)
    pooled_citation_1 = tf.keras.layers.GlobalAveragePooling1D()(citation_1_emb)

    concat_data = tf.keras.layers.Concatenate(name='concat_data', axis=-1)([pooled_language_model_output, pooled_citation_0, 
                                                                            pooled_citation_1, journal])

    # Dense layer 1
    dense_output = tf.keras.layers.Dense(2048, activation='relu', kernel_regularizer='L2', name="dense_1")(concat_data)
    dense_output = tf.keras.layers.Dropout(0.20, name="dropout_1")(dense_output)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_1")(dense_output)

    # Dense layer 2
    dense_output = tf.keras.layers.Dense(1024, activation='relu', kernel_regularizer='L2', name="dense_2")(dense_output)
    dense_output = tf.keras.layers.Dropout(0.20, name="dropout_2")(dense_output)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_2")(dense_output)

    # Dense layer 3
    dense_output_l3 = tf.keras.layers.Dense(512, activation='relu', kernel_regularizer='L2', name="dense_3")(dense_output)
    dense_output = tf.keras.layers.Dropout(0.20, name="dropout_3")(dense_output_l3)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_3")(dense_output)

    output_layer = tf.keras.layers.Dense(num_classes, activation='sigmoid', name='output_layer')(dense_output)
    topk_outputs = tf.math.top_k(output_layer, k=topk)

    model = tf.keras.Model(inputs=[citation_0, citation_1, journal, language_model_output], 
                           outputs=topk_outputs)

    model.load_weights(model_chkpt)
    model.trainable = False

    return model

pred_model = create_model(len(target_vocab), 
                          len(citation_feature_vocab)+2,
                          os.path.join(model_path, "model_checkpoint/citation_part_only.keras"), topk=3)

pred_model.summary()

With the above code and the file I loaded to zenodo, the model loaded successfully. So I think that narrows it down to a package, I am assuming. The code above was done on an AWS EC2 in a conda env (python 3.10)

DavidGeorge528 commented 4 months ago

Hi, thanks for the minimal example, after running it in a fresh EC2 instance it worked fine, I then tried it locally on my Mac and it also worked. So I compared the differences and found that the tweaks I'd made to imports (To follow coding standards) actually had implementation impacts. So below is my modified version of your code, where I refactored tf.keras.layers to just layers using from tensorflow.keras import layers, which typically wouldn't have any implementation impacts, but in this case it breaks the code.

from pathlib import Path

import tensorflow as tf
from tensorflow.python import keras
from tensorflow.python.keras import layers

def create_model(num_classes: int, emb_table_size: int, model_chkpt: Path, topk: int = 5) -> keras.Model:
    """
    Function to create full model.

    Input:
    num_classes: number of classes
    emb_table_size: size of embedding table
    model_chkpt: path to model checkpoint
    topk: number of predictions to return

    Output:
    model: full model
    """
    # Inputs
    citation_0 = layers.Input((16,), dtype=tf.int64, name="citation_0")
    citation_1 = layers.Input((128,), dtype=tf.int64, name="citation_1")
    journal = layers.Input((384,), dtype=tf.float32, name="journal_emb")
    language_model_output = layers.Input((512, 768), dtype=tf.float32, name="lang_model_output")

    # Create a multi-class classification model using functional API
    pooled_language_model_output = layers.GlobalAveragePooling1D()(language_model_output)
    citation_emb_layer = layers.Embedding(input_dim=emb_table_size, output_dim=256, mask_zero=True, trainable=True, name="citation_emb_layer")

    citation_0_emb = citation_emb_layer(citation_0)
    citation_1_emb = citation_emb_layer(citation_1)

    pooled_citation_0 = layers.GlobalAveragePooling1D()(citation_0_emb)
    pooled_citation_1 = layers.GlobalAveragePooling1D()(citation_1_emb)

    concat_data = layers.Concatenate(name="concat_data", axis=-1)([pooled_language_model_output, pooled_citation_0, pooled_citation_1, journal])

    # Dense layer 1
    dense_output = layers.Dense(2048, activation="relu", kernel_regularizer="L2", name="dense_1")(concat_data)
    dense_output = layers.Dropout(0.20, name="dropout_1")(dense_output)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_1")(dense_output)

    # Dense layer 2
    dense_output = layers.Dense(1024, activation="relu", kernel_regularizer="L2", name="dense_2")(dense_output)
    dense_output = layers.Dropout(0.20, name="dropout_2")(dense_output)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_2")(dense_output)

    # Dense layer 3
    dense_output_l3 = layers.Dense(512, activation="relu", kernel_regularizer="L2", name="dense_3")(dense_output)
    dense_output = layers.Dropout(0.20, name="dropout_3")(dense_output_l3)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_3")(dense_output)

    output_layer = layers.Dense(num_classes, activation="sigmoid", name="output_layer")(dense_output)
    topk_outputs = tf.math.top_k(output_layer, k=topk)

    model = keras.Model(inputs=[citation_0, citation_1, journal, language_model_output], outputs=topk_outputs)

    model.load_weights(model_chkpt.as_posix())
    model.trainable = False

    return model

if __name__ == "__main__":
    model = create_model(4521, 6008, Path("oa_artifacts") / "model_checkpoint" / "citation_part_only.keras")
    print(model.summary())

This results in the above error OSError: Unable to open file (file signature not found) I wonder if you can reproduce the same error using my code?

If I edit the code back to using tf.keras.layers like below, the error goes away.

from pathlib import Path

import tensorflow as tf

def create_model(num_classes: int, emb_table_size: int, model_chkpt: Path, topk: int = 5) -> tf.keras.Model:
    """
    Function to create full model.

    Input:
    num_classes: number of classes
    emb_table_size: size of embedding table
    model_chkpt: path to model checkpoint
    topk: number of predictions to return

    Output:
    model: full model
    """
    # Inputs
    citation_0 = tf.keras.layers.Input((16,), dtype=tf.int64, name="citation_0")
    citation_1 = tf.keras.layers.Input((128,), dtype=tf.int64, name="citation_1")
    journal = tf.keras.layers.Input((384,), dtype=tf.float32, name="journal_emb")
    language_model_output = tf.keras.layers.Input((512, 768), dtype=tf.float32, name="lang_model_output")

    # Create a multi-class classification model using functional API
    pooled_language_model_output = tf.keras.layers.GlobalAveragePooling1D()(language_model_output)
    citation_emb_layer = tf.keras.layers.Embedding(
        input_dim=emb_table_size, output_dim=256, mask_zero=True, trainable=True, name="citation_emb_layer"
    )

    citation_0_emb = citation_emb_layer(citation_0)
    citation_1_emb = citation_emb_layer(citation_1)

    pooled_citation_0 = tf.keras.layers.GlobalAveragePooling1D()(citation_0_emb)
    pooled_citation_1 = tf.keras.layers.GlobalAveragePooling1D()(citation_1_emb)

    concat_data = tf.keras.layers.Concatenate(name="concat_data", axis=-1)(
        [pooled_language_model_output, pooled_citation_0, pooled_citation_1, journal]
    )

    # Dense layer 1
    dense_output = tf.keras.layers.Dense(2048, activation="relu", kernel_regularizer="L2", name="dense_1")(concat_data)
    dense_output = tf.keras.layers.Dropout(0.20, name="dropout_1")(dense_output)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_1")(dense_output)

    # Dense layer 2
    dense_output = tf.keras.layers.Dense(1024, activation="relu", kernel_regularizer="L2", name="dense_2")(dense_output)
    dense_output = tf.keras.layers.Dropout(0.20, name="dropout_2")(dense_output)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_2")(dense_output)

    # Dense layer 3
    dense_output_l3 = tf.keras.layers.Dense(512, activation="relu", kernel_regularizer="L2", name="dense_3")(dense_output)
    dense_output = tf.keras.layers.Dropout(0.20, name="dropout_3")(dense_output_l3)
    dense_output = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layer_norm_3")(dense_output)

    output_layer = tf.keras.layers.Dense(num_classes, activation="sigmoid", name="output_layer")(dense_output)
    topk_outputs = tf.math.top_k(output_layer, k=topk)

    model = tf.keras.Model(inputs=[citation_0, citation_1, journal, language_model_output], outputs=topk_outputs)

    model.load_weights(model_chkpt.as_posix())
    model.trainable = False

    return model

if __name__ == "__main__":
    model = create_model(4521, 6008, Path("oa_artifacts") / "model_checkpoint" / "citation_part_only.keras")
    print(model.summary())

Strange behaviour. But at least its solved

DavidGeorge528 commented 4 months ago

Ok so after a bit more digging, changing

from tensorflow.python import keras
from tensorflow.python.keras import layers

to

import keras
from keras import layers

makes the error go away in my code example above. Not sure why but it fixes the issue while letting me follow import guidelines. Thanks for your time Justin, apologies for the rabbit hole we had to go down.

jpbarrett13 commented 4 months ago

So I think you actually want to do this:

from tensorflow import keras
from tensorflow.keras import layers

Not sure where you got the "python" but this should work. I don't think you should be importing directly from keras. Everything should be imported from tensorflow.

DavidGeorge528 commented 4 months ago

Hi, the reason I used python in the import is because VSCodes pylance linter complains if I don't add it, see the screenshot below

image

But its fine, importing directly from keras fixes the issue too.