C++ implementation for image classification pipeline

rajveerb commented 1 year ago

The goal is to convert existing implementation in this file to a C++ implementation.

With the C++ implementation, a finer granularity analysis can be done using profiling/instrumentation tools like Intel VTune for instance. It will also allow us to experiment memory management policies with finer control than in case of python implementation. The dataloader in C++ will allow use to leverage more CPU cores than before.

The-Death-Reaper commented 1 year ago

Completed Tasks

[x] Set up the environment to be able to use libtorch in C++ and run a simple app using libtorch
[x] I was able to import and run the libtorch library in C++. Steps included https://pytorch.org/cppdocs/installing.html - Follow the steps listed here to set up libtorch. Download the appropriate libtorch library for your distribution and cuda version. https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local - Setup Cuda to be able to train models on GPU if needed. This selection is for WSL but choose the necessary distribution and version. Add set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc") to the cmake file - Check the cpp_test folder for example You should be able to build and compile the example-app.cpp using the CMake commands. Running setup.sh will also build it for you.
[x] Look at the PyTorch examples in C++ and run one of them to see if a complete ML framework can be built. I ran the dcgan example in C++ to just check if everything was in order. I was able to run it and get the necessary outputs.

All the completed tasks have been recorded in code/cpp_test under the dev_mayur branch. I will raise a PR once I clean the folder and confirm it can be reproduced in the lab systems.

Pending Tasks:

Look at how to extend the Dataset class for custom datasets and dataloaders. Add functions to do the necessary preprocessing.

1. Crop
2. Rotate
3. Convert to tensor
4. normalize

Convert the Python file to train the model entirely in C++.

The-Death-Reaper commented 1 year ago

Update:

Completed Tasks:

Added functions to do the following preprocessing in C++

[x] RandomResizedCrop
[x] RandomHorizontalFlip
[x] Tensor
[x] Normalisation

Checked and confirmed the correctness of: (Have added files under correctness_checker)

[x] RandomResizedCrop - Had to make changes to match the output of python
[x] RandomHorizontalFlip
[x] Tensor
[x] Normalisation

We can now play around with parameters if needed with some degree of confidence that it will match in both languages.

Got a better understanding of how to code a ML model in C++ and to some extent, the features available in libtorch.

Pending Tasks:

(Required for profiling and also will allow us to double-check correctness and maybe comment on the accuracy of models in different libraries and languages)

[ ] Train the model in C++ using the preprocessed data from C++ - will require coding the model in C++. Have to gain a better understanding of libtorch APIs and search for open-sourced models if any.
[ ] Train the model in Python using the preprocessed data from C++ - may be complicated since updating the existing code to read and use preprocessed data might be hard but that is the only visible bottleneck

rajveerb commented 1 year ago

I tried to use the code pushed in dev_mayur_kepler but I am facing issues with building the code.

Can you make make files or add better readme about the setup needed?

I had link issues with dcgan and OpenCV lib missing for example-app

Ability to replicate the same experiments is important for benchmarks. If I cannot replicate it on the same machine then its an issue.

Note: our machine has cuda 11.8 and 12.0 installed. I used the libtorch implementation specific to 11.8 and cuda version 11.8 to build it as well.

Below are the error logs faced while building the code:

For example-app:

environment variables for building:

PATH contains /usr/local/cuda-11.8/bin

LD_LIBRARY_PATH contains /usr/local/cuda-11.8/lib64

$ cmake -DCMAKE_PREFIX_PATH=/home/rbachkaniwala3/work/ml-pipeline-benchmark/code/cpp_test/libtorch/ ..
...
By not providing "FindOpenCV.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "OpenCV", but
  CMake did not find one.

  Could not find a package configuration file provided by "OpenCV" with any
  of the following names:

    OpenCVConfig.cmake
    opencv-config.cmake

  Add the installation prefix of "OpenCV" to CMAKE_PREFIX_PATH or set
  "OpenCV_DIR" to a directory containing one of the above files.  If "OpenCV"
  provides a separate development package or SDK, be sure it has been
  installed.

For dcgan:


$ cmake -DCMAKE_PREFIX_PATH=/home/rbachkaniwala3/work/ml-pipeline-benchmark/code/cpp_test/libtorch/ ..
...
CMake Warning at CMakeLists.txt:18 (add_executable):
  Cannot generate a safe runtime search path for target dcgan because files
  in some directories may conflict with libraries in implicit directories:

    runtime library [libnvToolsExt.so.1] in /usr/lib/x86_64-linux-gnu may be hidden by files in:
      /usr/local/cuda-11.8/lib64

  Some of these libraries may not be found correctly.

$ make
/usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/dcgan.dir/build.make:100: dcgan] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/dcgan.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

The-Death-Reaper commented 1 year ago

Agreed, I was writing a script for setting up everything but haven't completed and pushed it yet.

With respect to the example-app, it's to do with setting up opencv for C++. I haven't added it in the readme. I will update it.

With respect to dcgan, will check the cmake file and make it more robust. The error could be due to shared library version conflicts.

I will take these up as ad-hoc tasks along with the other things

The-Death-Reaper commented 1 year ago

Update:

[x] Scripts to set up the primary environment to run cpp programs using libtorch and OpenCV for c++ (Only present in dev_mayur_kepler branch. Post testing will merge with dev_mayur)
- The script downloads and installs libtorch and opencv in the data directory of the logged in user - requires about 1.5gb of space
- We have to update our include and bin paths to use the right Cuda version. This is done inside the script file but if you wish to do it in a global setting for yourself, add these to your profile
  - export PATH=/usr/local/cuda-11.8/bin${PATH:+:${PATH}}
  - export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
- [ ] I haven't automated setup of Python env yet. @rajveerb, what is the recommended way to setup a Python env on Kepler or is there any script to do that already?
[ ] Train the model in C++
- [x] Learn about torch script
- [x] Serialise the model in python either by using tracing
- [x] Deserialise the model in C++
- [ ] Continue training the model from that checkpoint? / Play around with the model

Documentation

https://pytorch.org/tutorials/advanced/cpp_export.html

There exist two ways of converting a PyTorch model to Torch Script. The first is known as tracing, a mechanism in which the structure of the model is captured by evaluating it once using example inputs, and recording the flow of those inputs through the model. This is suitable for models that make limited use of control flow.

The second approach is to add explicit annotations to your model that inform the Torch Script compiler that it may directly parse and compile your model code, subject to the constraints imposed by the Torch Script language.

The-Death-Reaper commented 1 year ago

@rajveerb @kexinrong

I read about torchscript tracing and annotations in more detail. Since we are using the standard model and we are not adding any new logic through control statements or loops in the model layers, tracing is the recommended option for using torchscript (as per the documentation and general public opinion). Tracing fails only in the case of control statements or decision loops inside the model layers.

I was thinking of using tracing and transfering the model to C++ and training/testing it there and will get started on it (regardless of annotations or tracing, the next steps remain same anyway). What do you recommend?

rajveerb commented 1 year ago

@The-Death-Reaper

I read about torchscript in the link that you sent before.

Go ahead with torchscript's tracing option.

The-Death-Reaper commented 1 year ago

Update - Oct 4th, 2023

[x] Use torchscript to serialize model from python
[x] Use torchscript and deserialize model to C++
[ ] Implement train and test functions
- [x] Basic Implementation
- [ ] Optimiser - buggy currently due to strong type checks in cpp
- [ ] Use the imagenet dataset
  - [ ] Question: Whats the target in this dataset?

The-Death-Reaper commented 1 year ago

Update - Oct 11th, 2023

[x] Use torchscript to serialize model from python
[x] Use torchscript and deserialize model to C++
[x] Implement train and test functions
- [x] Basic Implementation
- [x] Optimiser - fixed (simple fix through iteration of parameters and conversion to required type)
- [x] Match the python implementation in terms of batch_size, num_workers - Anything else in particular to consider?
- [ ] Use the imagenet dataset - read issues for some reason :(
  - [ ] Question: Whats the target in this dataset? - have to figure out

rajveerb commented 1 year ago

@The-Death-Reaper

Can you cd into the imagenet dir?

The-Death-Reaper commented 1 year ago

@The-Death-Reaper

Can you cd into the imagenet dir?

No

rajveerb commented 1 year ago

@The-Death-Reaper

Try again

The-Death-Reaper commented 1 year ago

@rajveerb

Working now, ty

The-Death-Reaper commented 1 year ago

Update - Oct 18th, 2023

[x] Use the imagenet dataset
[x] Generate targets as per pytorch implementation
[ ] Use pretrained model or train a new model and profile it

The-Death-Reaper commented 1 year ago

Tasks- Oct 25th, 2023

[ ] Use pretrained model and verify correctness
[ ] Look at specific flows in the python code and ensure that the same semantics are maintained in C++
[ ] Look at number of GPUs being used and GPU selection support in libtorch

Update- Oct 30th, 2023 - @rajveerb

[ ] Getting seg fault when trying to load the model through torchscript and train it in C++ - most likely it's related to weights being read erroneously, debugging it right now. I will post updates till Wednesday
[x] The support for DistributedDataParallel inference in C++ is still not available : https://github.com/pytorch/pytorch/issues/48959 - will check if there are commits we can use
[X] https://github.com/pytorch/pytorch/issues/47650 - We cannot currently use a scripted model in C++ in DataParallel mode.

Notes

There are a couple of issues here that are collectively contributing to the problem
Torchscript does not support tracing of DataParallel models directly. You have to trace the inner model and then recreate the DataParallel model. This works for Python to Python transfer learning!!
But it does not work for C++!! As noted here: https://github.com/pytorch/pytorch/issues/47650
Depending on the story of the paper, we can either create a vanilla resnet Dataparallel model in C++ from scratch and run inferences and profiling on that

rajveerb commented 11 months ago

I am trying to run the experiment for cpp on cloudlab. Where are the instructions to setup env? Does your code work for CUDA 11.8? Where are the instructions/commands to run?

Can you push the solution to above question in dev_mayur_kepler soon?

The-Death-Reaper commented 11 months ago

I've pushed the required info to the branch. I'll continue cleaning the directory and adding comments over the next few days. Do keep me updated on the issues faced so I can take those up as well :).

rajveerb commented 11 months ago

How does normalization happen? Is it applied on each image when other transforms happen or when an entire batch is ready?

I am asking this because of below lines of code in this file:

auto train_set = CustomDataset(data.first)
    .map(torch::data::transforms::Normalize<>({0.485, 0.456, 0.406}, {0.229, 0.224, 0.225}))
    .map(torch::data::transforms::Stack<>());

rajveerb commented 11 months ago

@The-Death-Reaper

Can you also add code to pass batch number, GPUs to be used, data loader workers, and other configurable arguments similar to python code in C++ code?

Currently, I have to compile each time.

The-Death-Reaper commented 11 months ago

Normalization

I'm not sure, I will look into this but if you know how it happens in the Python implementation, it is safe to assume the same happens in the C++ version since the backend remains the same.

CustomDataset inherits from Dataset which I assume the Python datasets are mapped to. Nonetheless, I'll update this thread with a confirmation.

Modifications

Sure, I can do that. If you can point me toward a script/file that shows how it should finally look, I can mimic the interface

rajveerb commented 11 months ago

You can look at the arguments in this file.

Some args are not possible, for instance PyTorch profiler args, which you can ignore.

The-Death-Reaper commented 11 months ago

@rajveerb I've pushed a basic modification to read configurations from a config file. Don't forget to copy/modify the config file to/in the build folder. Lmk if this is sufficient.

The exact interface as Python is not mimicked as C++ will need absolute ordering of arguments but if that is preferred over a file I can make that change quickly too. Also, do check if more parameters are needed as well

rajveerb commented 10 months ago

@The-Death-Reaper

I looked at the code. I haven't tested it yet

This uses only a single GPU correct, based on our discussion about Data Parallel in the meetings?

Is there a reason for num_worker to be commented out?

Also, can you please cleanup the scripts to setup env?

rajveerb / lotus