Working XOR-Example - Batch Feeding does not work properly

Nurator commented 6 years ago

Hi!

The last few days I struggled to implement a working XOR tutorial example as described in https://aimatters.wordpress.com/2016/01/16/solving-xor-with-a-neural-network-in-tensorflow/

Fortunately, I got it to work properly, so if anyone is interested, here is the code (feel free to use it as an example file in the code):

// Define the input and output of the XOR example, see the Python example at
// https://aimatters.wordpress.com/2016/01/16/solving-xor-with-a-neural-network-in-tensorflow/
double[,] xData = new double[,]
{
    {0, 0},
    {0, 1},
    {1, 0},
    {1, 1}
};
double[] yData = new double[]
{
    0,
    1,
    1,
    0
};

// Create a new TFSession to do anything 
using (var session = new TFSession())
{
    // Initialize a graph to build the neural network structure
    var graph = session.Graph;

    // Define the size of the input and output
    var x = graph.VariableV2(new TFShape(1, 2), TFDataType.Double);
    var y = graph.VariableV2(new TFShape(1, 1), TFDataType.Double);

    // Define the unknown weights Theta and the biases of both layers
    var Theta1 = graph.VariableV2(new TFShape(2, 2), TFDataType.Double);
    var Theta2 = graph.VariableV2(new TFShape(2, 1), TFDataType.Double);
    var Bias1 = graph.VariableV2(new TFShape(1, 2), TFDataType.Double);
    var Bias2 = graph.VariableV2(new TFShape(1, 1), TFDataType.Double);

    // Define the actual computation of the output prediction
    var A2 = graph.Sigmoid(graph.Add(graph.MatMul(x, Theta1), Bias1));
    var Prediction = graph.Sigmoid(graph.Add(graph.MatMul(A2, Theta2), Bias2));

    // Define initializion of weights to random values and biases to 0
    var initTheta1 = graph.Assign(Theta1, graph.RandomNormal(new TFShape(2, 2)));
    var initTheta2 = graph.Assign(Theta2, graph.RandomNormal(new TFShape(2, 1)));
    var initBias1 = graph.Assign(Bias1, graph.Const(new double[,] { { 0, 0 } }));
    var initBias2 = graph.Assign(Bias2, graph.Const(new double[,] { { 0 } }));

    // Define the cost function you want to minimize (cross entropy, MSE, etc.)
    var firstcost = graph.Mul(y, graph.Log(Prediction));
    var secondcost = graph.Mul(graph.Sub(graph.OnesLike(y), y), graph.Log(graph.Sub(graph.OnesLike(y), Prediction)));
    var cost = graph.ReduceMean(graph.Neg(graph.Add(firstcost, secondcost)));
    //var cost = graph.ReduceMean(graph.SquaredDifference(y, Prediction));
    //var cost = graph.ReduceMean(graph.Abs(graph.Sub(y, Prediction)));

    // Define the learning rate 
    var learning_rate = graph.Const(0.01);

    // Define Compution of gradients of your cost function in respect to all learnable values in the network
    var grad = graph.AddGradients(new TFOutput[] { cost }, new TFOutput[] { Theta1, Theta2, Bias1, Bias2 });

    // Optimization works by applying gradient descent to all learnable values
    // Make sure that the order matches with the AddGradients function!
    var optimize = new[]
    {                   
        graph.ApplyGradientDescent(Theta1,learning_rate,grad[0]).Operation,
        graph.ApplyGradientDescent(Theta2,learning_rate,grad[1]).Operation,
        graph.ApplyGradientDescent(Bias1,learning_rate,grad[2]).Operation,
        graph.ApplyGradientDescent(Bias2,learning_rate,grad[3]).Operation,
    };

    // After defining the graph, we actually initialize the values 
    session.GetRunner().AddTarget(initTheta1.Operation, initTheta2.Operation, initBias1.Operation, initBias2.Operation).Run();

    // Run for enough epochs to get a good performance
    for (var i = 0; i < 100000; i++)
    {
        for (var j = 0; j < 4; j++)
        {
            // Get each row of xData in Tensor form
            var xDataFeed = new TFTensor(GetRowFrom2DArray(xData, j));

            // Add the input and output data to the network one by one. Call Run() to call the optimize
            // function and thus the gradient descent once
            session.GetRunner()
            .AddInput(x, xDataFeed)
            .AddInput(y, yData[j])
            .AddTarget(optimize).Run();
        }

        if (i % 10000 == 0)
        {
            // Every 10000 epochs, display the current prediction for all x values
            // The Fetch command gets the current value of prediction and stores it in result[0]
            var result = session.GetRunner()
            .AddInput(x, xData)
            .AddInput(y, yData)
            .Fetch(Prediction).Run();
            double[,] PredictArray = (double[,])result[0].GetValue();

            Console.WriteLine("Prediction after {0} iterations:", i);
            for (int j = 0; j < 4; j++)
            {
                // Display ground truth and prediction
                Console.WriteLine("Expected: {0} Prediction: {1}", yData[j], PredictArray[j, 0]);
            }
        }
    }
}
static double[,] GetRowFrom2DArray(double[,] sliceArray, int rowindex)
        {
            // Helper function to get the data slice out of xData
            double[,] returnArray = new double[1, sliceArray.GetLength(1)];
            for (int i = 0; i < sliceArray.GetLength(1); i++)
            {
                returnArray[0, i] = sliceArray[rowindex, i];   
            }
            return returnArray;
        }

However, I struggled a long time with batch feeding the input. Why is it not possible to just use .AddInput(x, xData).AddInput(y, yData) for training? If I dont feed them one by one, the XOR output becomes 0.5 for each input, thus learning nothing. Also, reduceSum is 16 times higher than reduceMean, suggesting that maybe all entries of the 2D xData get used one by one and not row wise. Any idea, why this is not working? Or is this intended?

YanYas commented 6 years ago

There are so few examples available of how to use addgradient and apply it to optimizers. Thanks very much for sharing!

migueldeicaza commented 6 years ago

I do not quite understand the question, as I do not know where you are getting stuck.

For your first question, I would need to know what is it that you tried, and did not work, and how did it fail?

When you talk about "feed them one by one", I do not know what is it that you are feeding. I have to guess.

I do not know what "reduceSum is 16 times higher than reduceMean", you are asking me to go and debug a problem but are not giving me enough information.

And I could use a full example, not just a snippet.

Nurator commented 6 years ago

Ok sorry, I try to make my question more clear. Here is the complete working code

using System;
using TensorFlow;

namespace XOR
{
    class Program
    {
        static void Main(string[] args)
        {
            XOR();
        }

        static void XOR() {

            // Define the input and output of the XOR example, see the Python example at
            // https://aimatters.wordpress.com/2016/01/16/solving-xor-with-a-neural-network-in-tensorflow/
            double[,] xData = new double[,]
            {
                {0, 0},
                {0, 1},
                {1, 0},
                {1, 1}
            };
            double[] yData = new double[]
            {
                0,
                1,
                1,
                0
            };

            // Create a new TFSession to do anything 
            using (var session = new TFSession())
            {
                // Initialize a graph to build the neural network structure
                var graph = session.Graph;

                // Define the size of the input and output
                var x = graph.VariableV2(new TFShape(1, 2), TFDataType.Double);
                var y = graph.VariableV2(new TFShape(1, 1), TFDataType.Double);

                // Define the unknown weights Theta and the biases of both layers
                var Theta1 = graph.VariableV2(new TFShape(2, 2), TFDataType.Double);
                var Theta2 = graph.VariableV2(new TFShape(2, 1), TFDataType.Double);
                var Bias1 = graph.VariableV2(new TFShape(1, 2), TFDataType.Double);
                var Bias2 = graph.VariableV2(new TFShape(1, 1), TFDataType.Double);

                // Define the actual computation of the output prediction
                var A2 = graph.Sigmoid(graph.Add(graph.MatMul(x, Theta1), Bias1));
                var Prediction = graph.Sigmoid(graph.Add(graph.MatMul(A2, Theta2), Bias2));

                // Define initializion of weights to random values and biases to 0
                var initTheta1 = graph.Assign(Theta1, graph.RandomNormal(new TFShape(2, 2)));
                var initTheta2 = graph.Assign(Theta2, graph.RandomNormal(new TFShape(2, 1)));
                var initBias1 = graph.Assign(Bias1, graph.Const(new double[,] { { 0, 0 } }));
                var initBias2 = graph.Assign(Bias2, graph.Const(new double[,] { { 0 } }));

                // Define the cost function you want to minimize (cross entropy, MSE, etc.)
                var firstcost = graph.Mul(y, graph.Log(Prediction));
                var secondcost = graph.Mul(graph.Sub(graph.OnesLike(y), y), graph.Log(graph.Sub(graph.OnesLike(y), Prediction)));
                var cost = graph.ReduceMean(graph.Neg(graph.Add(firstcost, secondcost)));
                //var cost = graph.ReduceMean(graph.SquaredDifference(y, Prediction));
                //var cost = graph.ReduceMean(graph.Abs(graph.Sub(y, Prediction)));

                // Define the learning rate 
                var learning_rate = graph.Const(0.01);

                // Define Compution of gradients of your cost function in respect to all learnable values in the network
                var grad = graph.AddGradients(new TFOutput[] { cost }, new TFOutput[] { Theta1, Theta2, Bias1, Bias2 });

                // Optimization works by applying gradient descent to all learnable values
                // Make sure that the order matches with the AddGradients function!
                var optimize = new[]
                {                   
                    graph.ApplyGradientDescent(Theta1,learning_rate,grad[0]).Operation,
                    graph.ApplyGradientDescent(Theta2,learning_rate,grad[1]).Operation,
                    graph.ApplyGradientDescent(Bias1,learning_rate,grad[2]).Operation,
                    graph.ApplyGradientDescent(Bias2,learning_rate,grad[3]).Operation,
                };

                // After defining the graph, we actually initialize the values 
                session.GetRunner().AddTarget(initTheta1.Operation, initTheta2.Operation, initBias1.Operation, initBias2.Operation).Run();

                // Run for enough epochs to get a good performance
                for (var i = 0; i < 100000; i++)
                {
                    for (var j = 0; j < 4; j++)
                    {
                        // Get each row of xData in Tensor form
                        var xDataFeed = new TFTensor(GetRowFrom2DArray(xData, j));

                        // Add the input and output data to the network one by one. Call Run() to call the optimize
                        // function and thus the gradient descent once
                        session.GetRunner()
                        .AddInput(x, xDataFeed)
                        .AddInput(y, yData[j])
                        .AddTarget(optimize).Run();
                    }

                    if (i % 10000 == 0)
                    {
                        // Every 10000 epochs, display the current prediction for all x values
                        // The Fetch command gets the current value of prediction and stores it in result[0]
                        var result = session.GetRunner()
                        .AddInput(x, xData)
                        .AddInput(y, yData)
                        .Fetch(Prediction).Run();
                        double[,] PredictArray = (double[,])result[0].GetValue();

                        Console.WriteLine("Prediction after {0} iterations:", i);
                        for (int j = 0; j < 4; j++)
                        {
                            // Display ground truth and prediction
                            Console.WriteLine("Expected: {0} Prediction: {1}", yData[j], PredictArray[j, 0]);
                        }
                    }
                }
            }
        }

        static double[,] GetRowFrom2DArray(double[,] sliceArray, int rowindex)
        {
            // Helper function to get the data slice out of xData
            double[,] returnArray = new double[1, sliceArray.GetLength(1)];
            for (int i = 0; i < sliceArray.GetLength(1); i++)
            {
                returnArray[0, i] = sliceArray[rowindex, i];   
            }
            return returnArray;
        }
    }
}

As you can see here, in the "for i<100000 loop", I am giving each row of xData and yData as input one after another with help of the helper function "GetRowFrom2DArray". However, in the "i%10000==0" if case, I directly put the whole xData and yData as input and get the right cost back. Doing it as described works as intended. BUT, if I would use

session.GetRunner()
                        .AddInput(x, xData)
                        .AddInput(y, yData)
                        .AddTarget(optimize).Run();

directly, the network does not learn anything. In fact, the cost still decreases but for every input, the prediction becomes 0.5. Thus, it seems like AddInput is treating the complete xData in a wrong way, perhaps not using it row wise, but element wise. Also, using var cost = graph.ReduceSum(graph.Neg(graph.Add(firstcost, secondcost))); instead of var cost = graph.ReduceMean(graph.Neg(graph.Add(firstcost, secondcost))); is 16 times as high, although there are only 4 elements to learn from.

I hope this made my problem more clear, thank you for your awesome work!

th1761 commented 6 years ago

I notice if I change your definition of yData from double[] yData = new double[] { 0, 1, 1, 0 }; to double[,] yData = new double[,] { { 0 }, { 1 }, { 1 }, { 0 } }; it seems to work as expected, without having to feed the optimizer a row at a time.

I noticed it runs a lot faster, since you execute "optimize" once per iteration rather than 4 times, but it also converges more slowly. Comparing "cost" vs seconds spent optimizing for row-at-a-time and batch, it doesn't look like extracting rows is costing you much (This is running on CPU only.) Each sample point is 10,000 iterations.

Thanks for posting such a great example - I learned a lot from going through this! I didn't know the gradient methods were even exposed in the C interface.

migueldeicaza / TensorFlowSharp

Working XOR-Example - Batch Feeding does not work properly #249