Summary

Using an output dimension greater than one when initializing an XGBModule results in a dimension casting error when calling XGBModule.forward().

Example Code

Modifying the "Basic training of a GBM" example in the README, I create a BST

In [76]: np.random.seed(100)
    ...: n = 1000
    ...: input_dim = 20

In [77]: output_dim = 4

In [78]: X = np.random.random([n, input_dim])
    ...: B = np.random.random([input_dim, output_dim])
    ...: Y = X.dot(B) + np.random.random([n, output_dim])

In [79]: Y
Out[79]: 
array([[4.78748462, 4.88223932, 4.61753548, 4.95553127],
       [5.4778647 , 5.35047653, 4.7938925 , 5.50056676],
       [4.69915079, 5.47207687, 5.45985558, 4.75764288],
       ...,
       [3.67035842, 3.26544663, 4.31689218, 2.96028194],
       [6.1167191 , 5.88111875, 7.00932119, 5.82753769],
       [7.5207671 , 5.30229013, 6.0731196 , 4.3468991 ]])

In [80]: xnet = xgbmodule.XGBModule(n, input_dim, output_dim, params={})
    ...: 

In [81]: xnet.forward(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[81], line 1
----> 1 xnet.forward(X)

File ~/anaconda3/envs/analytics/lib/python3.11/site-packages/gbnet/xgbmodule.py:56, in XGBModule.forward(self, input_array, return_tensor)
     52 if self.training:
     53     FX_detach = self.FX.detach()
     54     FX_detach.copy_(
     55         torch.tensor(
---> 56             preds.reshape([self.batch_size, self.output_dim]), dtype=torch.float
     57         )
     58     )
     60 if return_tensor:
     61     if self.training:

ValueError: cannot reshape array of size 1000 into shape (1000,4)
> /Users/chrisswierczewski/anaconda3/envs/analytics/lib/python3.11/site-packages/gbnet/xgbmodule.py(56)forward()
     54             FX_detach.copy_(
     55                 torch.tensor(
---> 56                     preds.reshape([self.batch_size, self.output_dim]), dtype=torch.float
     57                 )
     58             )

Debugging

Looks like the underlying bst object isn't outputting the correct number of dimensions. The above run has the following set within self.forward():

ipdb> preds.shape
(1000,)
ipdb> self.dtrain
ipdb> self.training
True
ipdb> input_array.shape
(1000, 20)

This configuration puts us in the branch where

preds = self.bst.predict(xgb.DMatrix(input_array))

Haven't looked into it yet but it seems that the label set is batch_size * output_dim-dimensional:

    def __init__(...):
        # ...
        self.bst = xgb.train(
            self.params,
            xgb.DMatrix(init_matrix, label=np.zeros(batch_size * output_dim)),
            num_boost_round=0,
        )

Use Case

I'm trying to do something similar to the XGBModule + LGBModule example where I use an XGBModule as the initial layer and pass an intermediate-dimensional result to a basic MLP.

mthorrell commented 6 days ago

@cswiercz , awesome! i'm glad you're giving this a try. I think this is an XGBoost versioning issue. Older versions have slightly different interfaces (for example, this bit np.zeros(batch_size * output_dim)) basically tricks xgboost into making multi-dimension outputs). Can you post your version? And, if that version supports multi-output, I/we can try to fix what is broken for that version?

Your code runs fine for me on main and xgboost version 2.1.1.

from gbnet import xgbmodule
import numpy as np
np.random.seed(100)
n = 1000
input_dim = 20
output_dim = 4
X = np.random.random([n, input_dim])
B = np.random.random([input_dim, output_dim])
Y = X.dot(B) + np.random.random([n, output_dim])

xnet = xgbmodule.XGBModule(n, input_dim, output_dim, params={})

xnet.forward(X)

# OUTPUT
# Parameter containing:
# tensor([[0., 0., 0., 0.],
#         [0., 0., 0., 0.],
#         [0., 0., 0., 0.],
#         ...,
#         [0., 0., 0., 0.],
#         [0., 0., 0., 0.],
#         [0., 0., 0., 0.]], requires_grad=True)

mthorrell commented 5 days ago

I was able to re-create the error on XGB 1.4.1 (V1.6.1 seems to work fine... I'm not sure about 1.5.1). And, I was able to fix it with enough poking at the internal functions of xgb 1.4.1 (those that say This function should not be called directly by users.). I won't be able to get a PR in tonight but sometime tomorrow for sure.

I might also recommend updating xgboost :). Failing that you could also just use lightgbm.

cswiercz commented 5 days ago

Updating from XGBoost 1.5.1 to 1.6.1 resolved the issue. Thanks! Happy to review any PRs.

Now I have a separate issue where the test dataset "batch size" output doesn't match the "batch size" of the input. I noticed in the examples that only a single dataset was used to train and test. However, I'll at least cut a separate ticket and/or PR for this.

those that say This function should not be called directly by users

😂

Loving the idea behind this work, by the way. Makes total sense, though, it's an interesting problem how to handle the standard data batch processing of a typical NN model in XGB.

mthorrell commented 5 days ago

Now I have a separate issue where the test dataset "batch size" output doesn't match the "batch size" of the input. I noticed in the examples that only a single dataset was used to train and test. However, I'll at least cut a separate ticket and/or PR for this.

put the torch module into eval mode (this mode also, for example, turns off dropout when using torch in the usual way). I'll put that explicitly in the examples as they do need to be there.

from gbnet import xgbmodule
import numpy as np
np.random.seed(100)
n = 4
input_dim = 2
output_dim = 3
X = np.random.random([n, input_dim])
B = np.random.random([input_dim, output_dim])
Y = X.dot(B) + np.random.random([n, output_dim])

xnet = xgbmodule.XGBModule(n, input_dim, output_dim, params={})

xnet(X)
# Parameter containing:
# tensor([[0., 0., 0.],
#         [0., 0., 0.],
#         [0., 0., 0.],
#         [0., 0., 0.]], requires_grad=True)

xnet.eval()
xnet(np.random.random([2, input_dim]))
# tensor([[0., 0., 0.],
#         [0., 0., 0.]])

mthorrell commented 5 days ago

PR for eval steps added to the readme examples is here: https://github.com/mthorrell/gbnet/pull/35

And then closing out the original conversation, I won't make gbnet backward compatible for 1.5.1 or below. It can be done by the way, but if there is no need for it, then I won't add it.

mthorrell / gbnet

Bug in generating higher-dimensional output #34

Summary

Example Code

Debugging

Use Case