[python] `Dataset` cannot be constructed from sparse `Sequence`

Description

A Dataset cannot be constructed from a Sequence that returns sparse data. This behavior is unexpected as Dataset generally supports sparse data .

Reproducible example

from __future__ import annotations
import numbers
from typing import Iterable

import lightgbm as lgbm
import numpy as np
from scipy import sparse

class SparseSequence(lgbm.Sequence):
    def __init__(self, sparse_data: sparse.csr_array) -> None:
        assert sparse_data.ndim == 2
        self.sparse_data = sparse_data

    def __len__(self) -> None:
        return self.sparse_data.shape[0]

    def __getitem__(
        self,
        idx: numbers.Integral | slice | Iterable[int],
    ) -> sparse.csr_array:
        if isinstance(idx, numbers.Integral):
            return self._get_row(int(idx))
        elif isinstance(idx, (slice, Iterable)):
            iter_idx = range(len(self))[idx] if isinstance(idx, slice) else idx
            rows = [self._get_row(i) for i in iter_idx]
            return (
                sparse.csr_array(sparse.vstack(rows, format="csr"))
                if len(rows) != 0 else
                sparse.csr_array((0, self.sparse_data.shape[1]))
            )
        else:
            raise TypeError(
                f"Sequence index must be integer, slice or iterable, got {type(idx).__name__}"
            )

    def _get_row(self, idx: int) -> sparse.csr_array:
        return self.sparse_data[[idx], :]

sparse_array = sparse.csr_array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float64)
labels = [0, 1, 1, 0]

# Succeeds
lgbm.Dataset(data=sparse_array, label=labels).construct()
# Fails
lgbm.Dataset(data=SparseSequence(sparse_array), label=labels).construct()

Environment info

Python version 3.9.12

lightgbm == 3.3.2
numpy == 1.22.3
scipy == 1.8.0

Additional Comments

The immediate cause of failure is due to https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1543. Sparse arrays and matrices do not have the flags attribute, thus leading to the AttributeError. Even if this check is removed, a TypeError will then be encountered in https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1545-L1571 when numpy ufuncs are applied to the sparse data.

Thanks very much for the thorough report, @tony-theorem !!! We really appreciate the time you put into create a reproducible example.

Are you interested in working on a contribution to add this support? We'd be happy to help you with that if you are interested. Otherwise, someone else here will try to pick up the work of adding tests and a fix.

In the future, when opening issues here, please do the following:

Use commit-anchored links
- see https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files)
- "line 1545 of file basic.py on master" may point to something different a year from now than it did at the time you wrote this
- here are such links for what you pointed to:
  - https://github.com/microsoft/LightGBM/blob/c000b8cc689a9b3ee8bf4f294749d2f848f7176f/python-package/lightgbm/basic.py#L1543
  - https://github.com/microsoft/LightGBM/blob/c000b8cc689a9b3ee8bf4f294749d2f848f7176f/python-package/lightgbm/basic.py#L1545-L1571
Include the literal text of error messages
- this allows search engines to find your bug report when others encounter the same issue and search an error message
- AttributeError: flags not found

microsoft / LightGBM