Scipy sparse matrices not handled correctly by TPOT and autosklearn

openml / automlbenchmark

OpenML AutoML Benchmarking Framework

https://openml.github.io/automlbenchmark

MIT License

405 stars 134 forks source link

Scipy sparse matrices not handled correctly by TPOT and autosklearn #370

Open sebhrusen opened 3 years ago

sebhrusen commented 3 years ago

Failing datasets: https://openml.org/t/360932 https://openml.org/t/360932

serialization of sparse matrices was not applied correctly.
once fixed, the frameworks still fail with the following errors:

# TPOT
  File "/Users/seb/repos/ml/automlbenchmark/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 1359, in _check_dataset
    self.config_dict
ValueError: Not all operators in None supports sparse matrix. Please use "TPOT sparse" for sparse matrix.

#autosklearn
  File "/Users/seb/repos/ml/automlbenchmark/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/utils/multiclass.py", line 288, in type_of_target
    if y.ndim > 2 or (y.dtype == object and len(y) and
TypeError: len() of unsized object

We'll improve support for sparse data in a future version: for now, we can simply deserialize the sparse matrices as dense matrices for the frameworks that don't use pandas.

mfeurer commented 3 years ago

Just checking - are these sparse target matrices y? We might indeed not have tests for that.

CC @eddiebergman

sebhrusen commented 3 years ago

@mfeurer in this case both X and y are indeed sparse, not sure this makes sense for y. I currently fixed this by turning both into arrays as I thought the problem was X, but it's very possible that for some frameworks, it's only necessary to do this for y.

mfeurer commented 3 years ago

Thanks for the clarification. Auto-sklearn should support sparse X, but we'll check, and will also check what the behavior for sparse y values is.

sebhrusen commented 3 years ago

@mfeurer for autosklearn, sparse X with dense y seems to work fine (and faster), meaning that in your case, sparse y was the issue. Thanks for noticing this: ideally we'd like to have frameworks using sparse data whenever possible, so I'll probably just make the ys dense by default, and see individually for each framework regarding X. cc: @PGijsbers

eddiebergman commented 3 years ago

@sebhrusen It's probably in the interest of autosklearn to handle sparse y correctly in this case, I'll have a look into it

sebhrusen commented 3 years ago

@eddiebergman Sure, just mentioning that we have a workaround on our side for now that also seems to work for other frameworks. Thanks for fixing it on your side too.

eddiebergman commented 3 years ago

Hi @sebhrusen,

Just letting you know the fix should be in the next release and I tracked down the problem a little more and wrote a brief synopsis, incase it helps identify the problem for other libraries.