microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.74k stars 3.84k forks source link

[python] `Dataset` cannot be constructed from sparse `Sequence` #5207

Open tony-theorem opened 2 years ago

tony-theorem commented 2 years ago

Description

A Dataset cannot be constructed from a Sequence that returns sparse data. This behavior is unexpected as Dataset generally supports sparse data .

Reproducible example

from __future__ import annotations
import numbers
from typing import Iterable

import lightgbm as lgbm
import numpy as np
from scipy import sparse

class SparseSequence(lgbm.Sequence):
    def __init__(self, sparse_data: sparse.csr_array) -> None:
        assert sparse_data.ndim == 2
        self.sparse_data = sparse_data

    def __len__(self) -> None:
        return self.sparse_data.shape[0]

    def __getitem__(
        self,
        idx: numbers.Integral | slice | Iterable[int],
    ) -> sparse.csr_array:
        if isinstance(idx, numbers.Integral):
            return self._get_row(int(idx))
        elif isinstance(idx, (slice, Iterable)):
            iter_idx = range(len(self))[idx] if isinstance(idx, slice) else idx
            rows = [self._get_row(i) for i in iter_idx]
            return (
                sparse.csr_array(sparse.vstack(rows, format="csr"))
                if len(rows) != 0 else
                sparse.csr_array((0, self.sparse_data.shape[1]))
            )
        else:
            raise TypeError(
                f"Sequence index must be integer, slice or iterable, got {type(idx).__name__}"
            )

    def _get_row(self, idx: int) -> sparse.csr_array:
        return self.sparse_data[[idx], :]

sparse_array = sparse.csr_array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float64)
labels = [0, 1, 1, 0]

# Succeeds
lgbm.Dataset(data=sparse_array, label=labels).construct()
# Fails
lgbm.Dataset(data=SparseSequence(sparse_array), label=labels).construct()

Environment info

Python version 3.9.12

lightgbm == 3.3.2
numpy == 1.22.3
scipy == 1.8.0

Additional Comments

The immediate cause of failure is due to https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1543. Sparse arrays and matrices do not have the flags attribute, thus leading to the AttributeError. Even if this check is removed, a TypeError will then be encountered in https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1545-L1571 when numpy ufuncs are applied to the sparse data.

jameslamb commented 2 years ago

Thanks very much for the thorough report, @tony-theorem !!! We really appreciate the time you put into create a reproducible example.

Are you interested in working on a contribution to add this support? We'd be happy to help you with that if you are interested. Otherwise, someone else here will try to pick up the work of adding tests and a fix.


In the future, when opening issues here, please do the following:

  1. Use commit-anchored links
  2. Include the literal text of error messages
    • this allows search engines to find your bug report when others encounter the same issue and search an error message
    • AttributeError: flags not found