Issue running GenderTraining program on Ubuntu

MisterMcDuck commented 2 years ago

Hello,

I attempted to follow the instructions provided at https://github.com/takuya-takeuchi/DlibDotNet/wiki/Tutorial-for-Linux and https://github.com/takuya-takeuchi/FaceRecognitionDotNet/tree/master/tools/GenderTraining

to train a gender model as specified. I compiled everything with CUDA support, and can confirm that works as I've previously trained dlib networks on this machine.

I always specified 64/desktop cuda 112 when building the libraries. However, when I try to run the training program, I receive this error:

ubuntu@ip-172-30-0-90:~/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/bin/x64/Release/netcoreapp2.0$ ls
DlibDotNet.dll  DlibDotNet.xml            GenderTraining.dll  GenderTraining.runtimeconfig.dev.json  libDlibDotNetNativeDnn.so                   libDlibDotNetNativeDnnGenderClassification.so
DlibDotNet.pdb  GenderTraining.deps.json  GenderTraining.pdb  GenderTraining.runtimeconfig.json      libDlibDotNetNativeDnnAgeClassification.so

ubuntu@ip-172-30-0-90:~/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/bin/x64/Release/netcoreapp2.0$ dotnet GenderTraining.dll train -d=/home/ubuntu/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/UTKDataset -b=400 -e=600 -v=20
            Dataset: /home/ubuntu/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/UTKDataset
              Epoch: 600
      Learning Rate: 0.001
  Min Learning Rate: 1E-05
     Min Batch Size: 400
Validation Interval: 20

Start load train images
Load train images: 7824
Start load test images
Load test images: 1954

**************************** FATAL ERROR DETECTED ****************************

Error detected at line 202.
Error detected in file /opt/data/FaceRecognitionDotNet/src/DlibDotNet/src/dlib/dlib/../dlib/dnn/trainer.h.
Error detected in function void dlib::dnn_trainer<net_type, solver_type>::train_one_step(const std::vector<typename net_type::input_type>&, const std::vector<typename net_type::training_label_type>&) [with net_type = dlib::add_loss_layer<dlib::loss_multiclass_log_, dlib::add_layer<dlib::fc_<2ul, (dlib::fc_bias_mode)0u>, dlib::add_layer<dlib::dropout_, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::fc_<512ul, (dlib::fc_bias_mode)0u>, dlib::add_layer<dlib::dropout_, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::fc_<512ul, (dlib::fc_bias_mode)0u>, dlib::add_layer<dlib::max_pool_<3l, 3l, 2, 2>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::con_<384l, 3l, 3l, 1, 1, 1, 1>, dlib::add_layer<dlib::bn_<(dlib::layer_mode)0u>, dlib::add_layer<dlib::max_pool_<3l, 3l, 2, 2>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::con_<256l, 5l, 5l, 1, 1, 2, 2>, dlib::add_layer<dlib::bn_<(dlib::layer_mode)0u>, dlib::add_layer<dlib::max_pool_<3l, 3l, 2, 2>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::con_<96l, 7l, 7l, 4, 4>, dlib::input_rgb_image_sized<227ul>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void> >; solver_type = dlib::sgd; typename net_type::input_type = dlib::matrix<dlib::rgb_pixel>; typename net_type::training_label_type = long unsigned int].

Failing expression was data.size() == labels.size().

******************************************************************************

Aborted (core dumped)

I'm not sure how the two std:vectors could have a differing size. If you think it would help I could try this on a windows OS as this is just an Amazon EC2 instance.

Thanks for any advice you can give!

takuya-takeuchi commented 2 years ago

@MisterMcDuck This issue may be ralated to https://github.com/takuya-takeuchi/DlibDotNet/issues/272 So I think we have to modify DlibDotNet code.

MisterMcDuck commented 2 years ago

@takuya-takeuchi

Thanks for the link. I can confirm that with a simple modification as done in the linked PR, it's now training.

The modification I made, just for testing:

---- src/GenderClassification/dlib/dnn/loss/multiclass_log/gender/Gender.h ----
index a9daf2d..e779c16 100644
@@ -9,8 +9,8 @@
 #include "defines.h"
 #include "DlibDotNet.Native.Dnn/dlib/dnn/loss/multiclass_log/template.h"

-typedef unsigned long gender_out_type;
-typedef unsigned long gender_train_label_type;
+typedef uint32_t gender_out_type;
+typedef uint32_t gender_train_label_type;

 MAKE_LOSSMULTICLASSLOG_FUNC(gender_train_type,  matrix_element_type::RgbPixel, dlib::rgb_pixel, matrix_element_type::UInt32, gender_train_label_type, 100)

and

 src/DlibDotNet.Native.Dnn/dlib/dnn/loss/multiclass_log/LossMulticlassLogBase.h 
index 38d30f3..6f73d7d 100644
@@ -8,8 +8,8 @@

 #include "../LossBase.h"

-typedef unsigned long loss_multiclass_log_out_type;
-typedef unsigned long loss_multiclass_log_train_label_type;
+typedef uint32_t loss_multiclass_log_out_type;
+typedef uint32_t loss_multiclass_log_train_label_type;

 using namespace dlib;
 using namespace std;

and the result:

dotnet GenderTraining.dll train -d /media/chris/DATA/Datasets/UTKDataset/output
            Dataset: /media/chris/DATA/Datasets/UTKDataset/output
              Epoch: 300
      Learning Rate: 0.001
  Min Learning Rate: 1E-05
     Min Batch Size: 256
Validation Interval: 30

Start load train images
Load train images: 7824
Start load test images
Load test images: 1954
step#: 0     learning rate: 0.001  average loss: 0            steps without apparent progress: 0
step#: 5     learning rate: 0.001  average loss: 0.769476     steps without apparent progress: 0
step#: 9     learning rate: 0.001  average loss: 0.769381     steps without apparent progress: 0
step#: 14    learning rate: 0.001  average loss: 0.725918     steps without apparent progress: 7

If I get some time I'll try to bring together a PR, but it'd need to cover all the cases rather than just this one.

takuya-takeuchi commented 2 years ago

I think we should use uint64_t. Because dlib uses uint64_t when it is build in linux. Otherwise, using uint32_t occurs 'explicit type conversion'.

But you can continue to train by your code. This issue is not matter but it could occur only compile warning. Thanks :)

MisterMcDuck commented 2 years ago

I did see issues keeping UInt32, e.g. 1/2 sized arrays during the Validation phase, but worked around them. Out of curiosity I implemented UInt64 support for loss multiloss log, but I think they'd be breaking changes for the library which I think you would want to avoid. I separated the commits into basic support in std vector and the breaking changes in loss multiloss log if you're interested:

https://github.com/takuya-takeuchi/DlibDotNet/compare/master...MisterMcDuck:DlibDotNet:feature/UInt64

To see them. I guess overrides could be used, but I don't know how much interest there are in these changes.

takuya-takeuchi commented 2 years ago

@MisterMcDuck Thanks for your contribution and sorry for the late contact.

I created new PR from your branch. https://github.com/takuya-takeuchi/DlibDotNet/pull/281

Your change looks good to me :) TBH, I do not take care of breaking changes.

I try to build and test it on windows, linux and osx.

Thanks a lot.

takuya-takeuchi commented 2 years ago

It should be resolved by 1.3.0.7

takuya-takeuchi / FaceRecognitionDotNet

Issue running GenderTraining program on Ubuntu #205