EmbeddingBag.cu:143: EmbeddingBag_updateOutputKernel_sum_mean: block: [61,0,0], thread: [0,2,0] Assertion `input[emb] < numRows` failed.

davidfstein commented 1 month ago

Im trying to run the ExampleTransformer from the documentation on a custom dataset. Training proceeds fine, but each time, at a particular batch, the inference fails. Anyone know what would cause this?

class ExampleTransformer(Module):
    def __init__(
        self,
        channels: int,
        out_channels: int,
        num_layers: int,
        num_heads: int,
        col_stats: Dict[str, Dict[StatType, Any]],
        col_names_dict: Dict[torch_frame.stype, List[str]],
    ):
        super().__init__()
        self.encoder = StypeWiseFeatureEncoder(
            out_channels=channels,
            col_stats=col_stats,
            col_names_dict=col_names_dict,
            stype_encoder_dict = {
                stype.categorical: EmbeddingEncoder(),
                stype.numerical: LinearEncoder(),
                stype.multicategorical: MultiCategoricalEmbeddingEncoder()
            }
        )
        self.tab_transformer_convs = ModuleList([
            TabTransformerConv(
                channels=channels,
                num_heads=num_heads,
            ) for _ in range(num_layers)
        ])
        self.decoder = Linear(channels, out_channels)

    def forward(self, tf):
        x, _ = self.encoder(tf)
        for tab_transformer_conv in self.tab_transformer_convs:
            x = tab_transformer_conv(x)
        out = self.decoder(x.mean(dim=1))
        return out

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ExampleTransformer(
    channels=16,
    out_channels=train.num_classes,
    num_layers=2,
    num_heads=4,
    col_stats=train.col_stats,
    col_names_dict=train.tensor_frame.col_names_dict,
).to(device)

optimizer = torch.optim.Adam(model.parameters())

for epoch in range(5):
    for tf in train_data_loader:
        tf = tf.to(device)
        pred = model(tf)
        loss = F.cross_entropy(pred, tf.y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(epoch)

akihironitta commented 1 month ago

Can you check the model definition (i.e., the size of the embedding layer) and the batch that includes indices to the embedding layer? I guess there's a new category at val/test time that isn't included in the col_stats, but I'm not 100% sure until I reproduce it on my side. If you could provide us with your code and full error message to reproduce it, one of us will take a look :)

davidfstein commented 1 month ago

It seems like it was an unrelated error actually, but shouldn't the library be able to handle new categories at val/test time. I worked around this by using the "split_col" of the dataset, but what if I want my train, test, and val data to be separate datasets?

akihironitta commented 1 month ago

PyTorch Frame could have some default handling by, e.g., treating new categories as N/A and mapping N/A to the most frequent category. I'm closing this issue as your original issue has already been resolved, but feel free to open a new issue to discuss this further.

pyg-team / pytorch-frame

EmbeddingBag.cu:143: EmbeddingBag_updateOutputKernel_sum_mean: block: [61,0,0], thread: [0,2,0] Assertion `input[emb] < numRows` failed. #439