Trouble replicating a PyTorch tutorial in mlpack

chase-metzger commented 5 years ago

Hi, I just started using mlpack about two week ago. Everything, for the most part, is going smoothly with using it. I learned most ML concepts using either TensorFlow or PyTorch before finding mlpack and using its tensor api.

I previously had used this article: https://towardsdatascience.com/writing-like-shakespeare-with-machine-learning-in-pytorch-d77f851d910c

My network/model is constructed as such:

FFN<NegativeLogLikelihood<cube, cube>, RandomInitialization> net;
net.Add<LSTM<>>(numLetters, hiddenSize, 2);
net.Add<Dropout<>>(0.1);
net.Add<Linear<>>(hiddenSize, numLetters);

The main part I'm having trouble with, it seems, is reimplementing the tensor operations to properly pass in the inputs to the net.Train function. The operations all run and print the expected result but the Train function for the network throws a logic_error as a result of out of bounds error. I assume I'm not encoding the input right.

And example of how I'm encoding the input (What I thought was the equivalent in mlpack) is below:

const auto makeInput = [numLetters](const char *line) -> cube {
        const auto strLen = strlen(line);
        cube result(strLen, 1, numLetters);
        result = result.zeros();
        for(int i = 0; i < strLen; ++i)
        {
            const auto letter = line[i];
            result.at(i, 0, static_cast<uword>(letter)) = 1.0;
        }
        return result;
    };

Any help with getting me unstuck from this would be greatly appreciated.

Thanks

rcurtin commented 5 years ago

Hi @chase-metzger, sorry for the slow response. I think you are right that it is a data encoding issue. The RNN::Train() function takes an arma::cube where

each slice should correspond to a time step each column should correspond to a data point each row should correspond to a dimension So, e.g., predictors(i, j, k) is the i'th dimension of the j'th data point at time slice k.

(http://mlpack.org/doc/mlpack-3.2.1/doxygen/classmlpack_1_1ann_1_1RNN.html#a443a41ed01d03b2962d175cfab448c2c)

It looks to me like you code has the sequence length as the number of rows, and the input size as the number of slices. So I think that if you can adapt your example so that the cube shape is the following:

cube result(numLetters, 1, strLen);

then the code should work. :+1:

P.S. you could do cube result(numLetters, 1, strLen, fill::zeros) to avoid the zeros() call. :)

chase-metzger commented 5 years ago

@rcurtin Hi thanks for the response! (and sorry for my late one)

Unfortunately, changing all the encoding to be that shape did not work. I'm still getting an out of bounds for the operator() operator.

I ran a debugger and before it crashes, the network is in the process of forwarding to the NegativeLogLikelihood which is the output layer. To be a little more specific, the crash occurs on line 39 of the negative_log_likelihood_impl.hpp file, which looks like this: (it really crashes in the Mat_meat file but I thought this info was relevant)

output -= input(currentTarget, i);

What I'm finding odd is that i in the above line is 0 in the debugger. So it's crashing on the first iteration?

rcurtin commented 5 years ago

Ah, thanks for the details. I think that the encoding you had before was incorrect, so, maybe there were two problems. :) Anyway, thanks for working with a debugger a little bit to figure out what the issue was.

I think that this is an issue of labels. What are you passing to Train() as your responses? For NegativeLogLikelihood, this should be the desired response at each time step, from 1 to the number of classes (not 0 to the number of classes minus 1). I think in your case the number of classes is numLetters? I'm just guessing about that though.

Can you show the code you are using to make the responses?

chase-metzger commented 5 years ago

This is how numLetters is defined

const int numLetters = 256; //The number of characters in an ascii table

... And this is the current implementation of the responses function You are correct about the numLetters, I'm basically trying to get the network to output the next letter in a sequence from the learned model. So I was following that tutorial and it seemed like the number of classes should be the number of letters that it can possibly output?

const auto makeTarget = [numLetters] (const char *line) -> cube {
        const auto strLen = strlen(line);
        cube result(1, strLen, 1, fill::zeros);
        for(int i = 0; i < strLen; ++i)
        {
            const auto letter = line[i];
            const auto letterIndex = static_cast<double>(letter + 1); //I just added the plus one to test to see if I didn't start at 0, since that's where ascii starts, it would work. 
            result.at(0, i, 0) = letterIndex;
        }
        return result;
    };

How exactly how are the responses supposed to be shaped?

rcurtin commented 5 years ago

For the NegativeLogLikelihood response, the responses should have basically the same shape as the predictors---so, since you are predicting individual letters, it should have shape like this: cube responses(1, 1, strLen) (assuming there is only one data point). So, then, responses(1, 1, i) should be the integer-valued response (from 1 to numLetters) for the i'th element in the string. (Essentially, we are not one-hot encoding the response there, just passing in the single integer value that represents the result's class.)

I think that should work... give it a try and let me know what happens. :)

chase-metzger commented 5 years ago

Ok, so I've tried making the shape of the responses and encoding as described.

It's giving me a error: matrix multiplication: incompatible matrix dimensions: 128x256 and 512x32 Where I believe that from the debugger I've figure the 256 is the number of classes, 128 is the hiddenSize and I think the 32 is the batch size of the LSTM layer.

rcurtin commented 5 years ago

What happens if you remove the Dropout layer? I looked through the LSTM code and based on what you wrote in the initial post and what I see in there, I think the output size from that layer should be 256x32. So if that is actually 512x32, I'm not totally sure where that issue is coming from.

Anyway, I'm happy to try and debug it but I want to make sure I'm using the same code as you... if you can provide the whole code that you're using so I can compile and step through it and see what's wrong, I'm happy to. :+1:

chase-metzger commented 5 years ago

@rcurtin Thanks. Unfortunately, I’m in California during the power shut offs and fires. So I won’t be able to upload a repo until possibly Friday.

I did try removing the Dropout layer before losing power. I think it gave a different result that was closer to what you describe. Also the 512, was my mistake. I tried changing one variable from 256 to 512 before posting that, just to see what would happen.

I’ll get a file uploaded soon. One slight issue is I’m using Qt for some of the string/file parsing, I load my training data from a CSV file and then convert everything to plain C++ strings. But I can easily remove that and then pass in test data.

chase-metzger commented 5 years ago

Heres a gist that has all the main implementation minus all the file loading.

Currently, it gives error: matrix multiplication: incompatible matrix dimensions: 128x256 and 1x1 Which seems really odd to me that there's a 1x1 that gets passed, I assume that's the part of the target I constructed.

https://gist.github.com/chase-metzger/cd3b9cc41796c42dfecc015e8732d0da

--Thanks for all the help

rcurtin commented 5 years ago

Hey @chase-metzger,

I took a dive into the code and saw some things that could be the issue. I did manage to make it work. I'll go through each bit:

An LSTM with an FFN doesn't make too much sense, so I'd suggest switching back to the RNN.
An IdentityLayer<> is required as the first layer of an RNN if an LSTM is used for the first layer. This is so that backpropagation through time works correctly.
The code that you had would create a training set with just the letter T, not the sample input THIS IS THE INPUT. I rewrote it below. There may be a few other changes too:

  std::vector<std::string> trainingData;
  trainingData.push_back(std::string("THIS IS THE INPUT"));

  const auto makeInput = [](const char *line) -> MatType {
    const auto strLen = strlen(line);
    // rows: number of dimensions
    // cols: number of sequences/points
    // slices: number of steps in sequences
    MatType result(numLetters, 1, strLen, fill::zeros);
    for(int i = 0; i < strLen; ++i)
    {
      const auto letter = line[i]; 
      result.at(static_cast<uword>(letter), 0, i) = 1.0;
    }
    return result;
  };

  const auto makeTarget = [] (const char *line) -> MatType {
    const auto strLen = strlen(line);
    // responses for NegativeLogLikelihood should be
    // non-one-hot-encoded class IDs (from 1 to num_classes)
    cube result(1, 1, strLen, fill::zeros);
    // the response is the *next* letter in the sequence
    for(int i = 0; i < strLen - 1; ++i)
    {
      const auto letter = line[i + 1];
      result.at(0, 0, i) = static_cast<uword>(letter) + 1.0;
    }
    // the final response is empty, so we set it to class 0
    result.at(0, 0, strLen - 1) = 1.0; 
    return result;
  };

  std::vector<cube> inputs(trainingData.size());
  std::vector<cube> targets(trainingData.size());
  for(int i = 0; i < trainingData.size(); ++i)
  {
    inputs[i] = makeInput(trainingData[i].c_str());
    targets[i] = makeTarget(trainingData[i].c_str());
  }

The Linear<> layer should have hiddenSize as its input size, since hiddenSize is the output size of the previous LSTM<> layer.
I'd suggest setting rho to maxLineLength, as in RNN<> rnn(maxLineLength); and rnn.Add<LSTM<>>(numLetters, hiddenSize, maxLineLength); (but make sure maxLineLength is the maximum line length... so in this case, 17).
The default optimizer configuration will be pretty slow for this example. You might try a simpler example, kind of like this:

ens::SGD<> sgd(0.01, 1, 100 /* only 100 maximum iterations, just to see it work */);
rnn.Train(inputs[0], targets[0], sgd);

So, overall, this is the code I have working. I also removed a little bit of code that became unnecessary.

#include <mlpack/core/cv/cv_base.hpp>
#include <mlpack/core/cv/metrics/accuracy.hpp>
#include <mlpack/core/cv/metrics/precision.hpp>
#include <mlpack/core/cv/metrics/mse.hpp>
#include <mlpack/core/cv/k_fold_cv.hpp>
#include <mlpack/methods/ann/rnn.hpp>
#include <mlpack/methods/ann/ffn.hpp>
#include <mlpack/methods/ann/layer/lstm.hpp>
#include <mlpack/methods/ann/layer/dropout.hpp>
#include <mlpack/methods/random_forest/random_forest.hpp>
#include <mlpack/methods/decision_tree/random_dimension_select.hpp>
#include <mlpack/core/arma_extend/arma_extend.hpp>

#include <vector>
#include <string>

int main(int argc, char *argv[])
{
  using namespace mlpack;
  using namespace mlpack::data;
  using namespace mlpack::tree;
  using namespace mlpack::cv;
  using namespace mlpack::ann;
  using namespace arma;

  // We have to make sure backpropagation through time doesn't take more
  // time steps than we have.
  const int maxLineLength = 17;
  const int hiddenSize = 128;
  const int numLetters = 256;

  using MatType = cube;
  std::vector<std::string> trainingData;
  trainingData.push_back(std::string("THIS IS THE INPUT"));

  //This is the orignal network that I used (FFN) but then I also tried a RNN
  RNN<> rnn(maxLineLength);
  rnn.Add<IdentityLayer<>>();
  rnn.Add<LSTM<>>(numLetters, hiddenSize, maxLineLength);
  rnn.Add<Dropout<>>(0.1);
  rnn.Add<Linear<>>(hiddenSize, numLetters);

  const auto makeInput = [](const char *line) -> MatType {
    const auto strLen = strlen(line);
    // rows: number of dimensions
    // cols: number of sequences/points
    // slices: number of steps in sequences
    MatType result(numLetters, 1, strLen, fill::zeros);
    for(int i = 0; i < strLen; ++i)
    {
      const auto letter = line[i];
      result.at(static_cast<uword>(letter), 0, i) = 1.0;
    }
    return result;
  };

  const auto makeTarget = [] (const char *line) -> MatType {
    const auto strLen = strlen(line);
    // responses for NegativeLogLikelihood should be
    // non-one-hot-encoded class IDs (from 1 to num_classes)
    cube result(1, 1, strLen, fill::zeros);
    // the response is the *next* letter in the sequence
    for(int i = 0; i < strLen - 1; ++i)
    {
      const auto letter = line[i + 1];
      result.at(0, 0, i) = static_cast<uword>(letter) + 1.0;
    }
    // the final response is empty, so we set it to class 0
    result.at(0, 0, strLen - 1) = 1.0;
    return result;
  };

  std::vector<cube> inputs(trainingData.size());
  std::vector<cube> targets(trainingData.size());
  for(int i = 0; i < trainingData.size(); ++i)
  {
    inputs[i] = makeInput(trainingData[i].c_str());
    targets[i] = makeTarget(trainingData[i].c_str());
  }

  ens::SGD<> sgd(0.01, 1, 100 /* only 100 maximum iterations, just to see it work */);
  rnn.Train(inputs[0], targets[0], sgd);
  return 0;
}

There are a few issues that I noticed here that are a little bit problematic; I can see that this was a little bit confusing, and I think we should work to improve that. So I think that I will open some issues and link them to this one, and we can see how things go from there. :)

rcurtin commented 5 years ago

I opened #2070 and #2071, and also #1267 and the related PR #1366 are relevant here---I think support like that would have helped the network be the right size.

chase-metzger commented 5 years ago

Wow, you have gone above and beyond to help me( relative to many other opensource projects)

Maybe I can make a few suggests to docs in the future (one for now)

I didn't see anywhere that the IdentityLayer has to be the first layer of an RNN if LSTM is used (It makes sense though)

Everything you describe makes sense. I was so lost in the matrix ops in the beginning, compared to numpy or pytorch. I stoppped paying attention to how I was not properly following how to encode the data from the tutorial. I completely forgot that it's the next character in the sequence, not what it's currently predicting.

I'm gonna start parsing the code closer. I'll start where you suggest from the issues and work out from there.

Thanks so much. I can't wait to actually make what I set out to do. Make a GUI for looking at parts of the network (mostly tables and lists for displaying the matrices)

Thanks again

Chase

rcurtin commented 5 years ago

Happy to help!

You're right that the documentation could be improved; I opened #2080 as an attempt, and it should hopefully help. Feel free to comment on that if you have any suggestions.

It's definitely true that the matrix operations are a little bit different with Armadillo and C++ than they would be in other packages. It's meant to be like the MATLAB syntax, but that's definitely a bit different than what it feels like in Python. :)

Excited to hear about how the GUI goes; if you have any more questions, we're happy to try to answer them as best we can.

mlpack-bot[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! :+1:

mlpack / mlpack

Trouble replicating a PyTorch tutorial in mlpack #2048