umyelab / LabGym

Quantify user-defined behaviors.
GNU General Public License v3.0
64 stars 5 forks source link

issue with training: untrained model has zero metrics #81

Closed delaroob closed 9 months ago

delaroob commented 9 months ago

Hello everyone,

I am trying to train a behaviour classifier to recognize a specific behavior (named "levego" in the project), without Detector. However, something does not seem to work during training and I can't figure out what the problem is. I did everything according to the user guide (or at least as far as I am concerned:D), which I also documented and will attach below. I am working with Windows 10, and created a venv named labgym where I installed all the packages (see the list below).

I succesfully labeled some frames (around 100), and judging by the output the data preparation also went well, but at training, even at the very first epoch, the ETA, accuracy, and loss metrics all show as 0. As a result, the training process triggers early stopping after the 5th epoch.

I would appreciate any help (and thank you so much for the code!). Please let me know if any additional information is needed to resolve this issue.

"Below"

Package list in venv named labgym:

absl-py                      2.0.0
astunparse                   1.6.3
cachetools                   5.3.2
certifi                      2023.7.22
charset-normalizer           3.3.2
contourpy                    1.2.0
cycler                       0.12.1
et-xmlfile                   1.1.0
filelock                     3.13.1
flatbuffers                  23.5.26
fonttools                    4.44.0
fsspec                       2023.10.0
gast                         0.5.4
google-auth                  2.23.4
google-auth-oauthlib         1.1.0
google-pasta                 0.2.0
grpcio                       1.59.2
h5py                         3.10.0
idna                         3.4
imageio                      2.32.0
Jinja2                       3.1.2
joblib                       1.3.2
keras                        2.15.0
kiwisolver                   1.4.5
LabGym                       2.2.2
lazy_loader                  0.3
libclang                     16.0.6
Markdown                     3.5.1
MarkupSafe                   2.1.3
matplotlib                   3.8.1
ml-dtypes                    0.2.0
mpmath                       1.3.0
networkx                     3.2.1
numpy                        1.26.2
oauthlib                     3.2.2
opencv-contrib-python        4.8.1.78
opencv-python                4.8.1.78
openpyxl                     3.1.2
opt-einsum                   3.3.0
packaging                    23.2
pandas                       2.1.3
pathlib                      1.0.1
patsy                        0.5.3
Pillow                       10.0.1
pip                          23.3.1
protobuf                     4.23.4
pyasn1                       0.5.0
pyasn1-modules               0.3.0
pyparsing                    3.1.1
python-dateutil              2.8.2
pytz                         2023.3.post1
requests                     2.31.0
requests-oauthlib            1.3.1
rsa                          4.9
scikit-image                 0.22.0
scikit-learn                 1.3.2
scikit-posthocs              0.8.0
scipy                        1.11.3
seaborn                      0.13.0
setuptools                   58.1.0
six                          1.16.0
statsmodels                  0.14.0
sympy                        1.12
tensorboard                  2.15.1
tensorboard-data-server      0.7.2
tensorflow                   2.15.0
tensorflow-estimator         2.15.0
tensorflow-intel             2.15.0
tensorflow-io-gcs-filesystem 0.31.0
termcolor                    2.3.0
threadpoolctl                3.2.0
tifffile                     2023.9.26
torch                        2.1.0
torchvision                  0.16.0
typing_extensions            4.8.0
tzdata                       2023.3
urllib3                      2.1.0
Werkzeug                     3.0.1
wheel                        0.41.3
wrapt                        1.14.1
wxPython                     4.2.1
XlsxWriter                   3.1.9

File structure:

labgym_makrop_1
│   male4_cropped_6-8_83-85_121-123_188-189_234-237_processed.avi
│   testing_reports.xlsx
│
├───examples
│   └───0
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_100_len15_std0.avi
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_100_len15_std0.jpg
    (… etc …)
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_99_len15_std0.avi
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_99_len15_std0.jpg
│
├───male4_cropped_6-8_83-85_121-123_188-189_234-237_processed
│   │   background.jpg
│   │   background_high.jpg
│   │   background_low.jpg
│   │
│   └───levego
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_182_len15_std0.avi
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_182_len15_std0.jpg
    (… etc …)
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_94_len15_std0.avi
│           male4_cropped_6-8_83-85_121-123_188-189_234-237_processed_0_94_len15_std0.jpg
│
└───prepared
        0_levego.avi
        0_levego.jpg
    (… etc …)
        100_levego.avi
        100_levego.jpg       

Workflow with terminal outputs:

initializing

C:\mentes>cd labgym
C:\mentes\labgym>Scripts\activate
(labgym) C:\mentes\labgym>python
Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from LabGym import gui
2023-12-01 20:26:34.430940: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING:tensorflow:From C:\mentes\labgym\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
You need to install Detectron2 to use the Detector module in LabGym:
https://detectron2.readthedocs.io/en/latest/tutorials/install.html
You need to install Detectron2 to use the Detector module in LabGym:
https://detectron2.readthedocs.io/en/latest/tutorials/install.html
>>> gui.gui()
The user interface initialized!

Preparing data and labeling

image image

Included body parts and background too, selected STD 0, and started generating behaviour examples.

image image

Did not resize the frames

image image image

...training

So the output here was quiet suspicious (ETA 0s, loss 0, accuracy 0 and stopping just after 5 epochs…)

image

But I wanted to try it out since at this point i had no idea what else to do with it. As of now, I am not even sure where to look.

image image

yujiahu415 commented 9 months ago

Hi,

You need to have at least 2 categories of behaviors, for example, 'levego' and 'background', to let the Categorizer learn something. Let me know if this solves your issue. Thanks!

delaroob commented 9 months ago

Hello,

Thank you for the quick reply, I'm sure it will work. It has been augmenting training examples for 4 hours now, but I guess that is normal since I'm working with CPU. I was wondering if there was any way to transfer the training part to Colab or something?

yujiahu415 commented 9 months ago

Hi,

The step of augmenting training examples only uses CPU and RAM (memory). The reason it takes long time might be the memory is running low. Seems you are using Windows system, you probably can mount virtual memory (https://www.windowscentral.com/how-change-virtual-memory-size-windows-10) to a hard drive (like D:) that is not C:\ drive and has a lot of free space (>100GB). This will give you more virtual RAM and should help to increase the speed.

The next step is training, which can utilize GPU for acceleration. If you are working with CPU and not satisfied with the speed, you probably can further reduce the input shape from the current 32 to 16. It seems the behavior classification task is easy in your case (you only have two behavior categories), so 16 might already give you good accuracy.

The GUI does not work for Colab. To run LabGym on Colab, you need to import and run the LabGym functions one by one. We currently don't have a documentation on how to script run LabGym functions but will have that ready in near future.

yujiahu415 commented 9 months ago

By the way, I saw you currently chose 'Categorizer with both Animation Analyzer and Pattern Recognizer' for the Categorizer type. But you probably can try to choose 'Categorizer with only Pattern Recognizer' when specifying the type of the Categorizer. This will significantly increase the processing speed at augmentation, training and analyzing steps.

delaroob commented 9 months ago

Okay, so I tried mounting more virtual memory to a D drive with like 200GB free storage, but it says that "the computer is fast enough and it is unlikely to provide additional benefit" (also its a portable harddrive so im not even sure if it could work), so I guess the 9GB RAM will have to do the heavy lifting:).

Anyway, I changed the input shape to 16 and chose the categorizer with only pattern recognizer as you suggested, and it has already finished the augmenting, so I hope the rest will also go well. Thank you so much for your help!