Input data with different lengths / filled with NAs

nightscape commented 5 years ago

I'm trying to use modAL in combination with tslearn to classify timeseries of different lengths. tslearn supports variable-length time series by filling the shorter time series up with NAs, but modAL calls

check_X_y(X, y, accept_sparse=True, ensure_2d=False, allow_nd=True, multi_output=True)

without setting force_all_finite = 'allow-nan'. Is there a reason for not allowing NAs, or did this use case just not come up before?

Thanks a lot!

zaksamalik commented 5 years ago

Seems like a relatively straightforward update might be to add force_all_finite: bool or str = True as a parameter to BaseLearner init and then pass self.force_all_finite to check_X_y in each function where it is called (_add_training_data, _fit_on_new and fit).

cosmic-cortex commented 5 years ago

Thanks for the observation and sorry for the late answer! I have just reached this task in my backlog :)

This case hasn't come up before. I don't see any reason to not allow NaNs, so we can just set force_all_finite = 'allow-nan' in every call of check_X_y. I like the solution of @zaksamalik, so I'll add this sometime during this week.

nightscape commented 5 years ago

Cool, thanks a lot!

cosmic-cortex commented 5 years ago

I have fixed a problem and additionally released the new version, this fix included. Let me know if there is a problem!

ciberger commented 4 years ago

Hi, it seems the issue is still present in Ranked batch-mode sampling.

Reprex (mostly from Ranked batch-mode sampling documentation)

import numpy as np
import xgboost as xgb 
from functools import partial
from modAL.batch import uncertainty_batch_sampling
from modAL.models import ActiveLearner

iris = load_iris()
X_raw = iris['data']
y_raw = iris['target']

# Isolate our examples for our labeled dataset.
n_labeled_examples = X_raw.shape[0]
training_indices = np.random.randint(low=0, high=n_labeled_examples + 1, size=3)

X_train = X_raw[training_indices]
y_train = y_raw[training_indices]

# Isolate the non-training examples we'll be querying.
X_pool = np.delete(X_raw, training_indices, axis=0)
y_pool = np.delete(y_raw, training_indices, axis=0)

# Setting an column's entry as np.nan
X_pool[0][0] = np.nan

# Pre-set our batch sampling to retrieve 3 samples at a time.
BATCH_SIZE = 3
preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)

# Specify our active learning model.
learner = ActiveLearner(
  estimator=xgb.XGBClassifier(),
  X_training=X_train,
  y_training=y_train,
  query_strategy=preset_batch,
  force_all_finite=False

)

query_index, query_instance = learner.query(X_pool)

Error message

Click to expand!

```python ValueError Traceback (most recent call last) in 40 ) 41 ---> 42 query_index, query_instance = learner.query(X_pool) ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/models/base.py in query(self, *query_args, **query_kwargs) 201 labelled upon query synthesis. --> 203 query_result = self.query_strategy(self, *query_args, **query_kwargs) 204 return query_result 205 ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in uncertainty_batch_sampling(classifier, X, n_instances, metric, n_jobs, **uncertainty_measure_kwargs) 208 uncertainty = classifier_uncertainty(classifier, X, **uncertainty_measure_kwargs) 209 query_indices = ranked_batch(classifier, unlabeled=X, uncertainty_scores=uncertainty, --> 210 n_instances=n_instances, metric=metric, n_jobs=n_jobs) 211 return query_indices, X[query_indices] ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in ranked_batch(classifier, unlabeled, uncertainty_scores, n_instances, metric, n_jobs) 161 instance_index, instance, mask = select_instance(X_training=labeled, X_pool=unlabeled, 162 X_uncertainty=uncertainty_scores, mask=mask, --> 163 metric=metric, n_jobs=n_jobs) 164 165 # Add our instance we've considered for labeling to our labeled set. Although we don't ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in select_instance(X_training, X_pool, X_uncertainty, mask, metric, n_jobs) 97 _, distance_scores = pairwise_distances_argmin_min(X_pool_masked.reshape(n_unlabeled, -1), 98 X_training.reshape(n_labeled_records, -1), ---> 99 metric=metric) 100 else: 101 distance_scores = pairwise_distances(X_pool_masked.reshape(n_unlabeled, -1), ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances_argmin_min(X, Y, axis, metric, metric_kwargs) 573 sklearn.metrics.pairwise_distances_argmin --> 575 X, Y = check_pairwise_arrays(X, Y) 576 577 if metric_kwargs is None: ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy) 139 X = check_array(X, accept_sparse=accept_sparse, dtype=dtype, 140 copy=copy, force_all_finite=force_all_finite, --> 141 estimator=estimator) 142 Y = check_array(Y, accept_sparse=accept_sparse, dtype=dtype, 143 copy=copy, force_all_finite=force_all_finite, ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 576 if force_all_finite: 577 _assert_all_finite(array, --> 578 allow_nan=force_all_finite == 'allow-nan') 579 580 if ensure_min_samples > 0: ~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype) 58 msg_err.format 59 (type_err, ---> 60 msg_dtype if msg_dtype is not None else X.dtype) 61 ) 62 # for object dtype data, we only check for NaNs (GH-13254) ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). ```

cosmic-cortex commented 4 years ago

Hi!

This seems like a scikit-learn issue :( The function pairwise_distances_argmin_min is called, which throws the error upon encountering the NaN. Unfortunately, there is no way to control this in the function.

Similarly, if you set force_all_finite=False but use an estimator which doesn't support this (like the ones in scikit-learn), it won't work, even though modAL allows you to use data with NaNs.

Do you have any suggestions how to solve this? At the moment, I don't see a proper solution, but this doesn't mean that there isn't one. (I don't want to internally remove NaNs and pass them to the external functions, because this would remain hidden from the user, possibly causing unintended consequences.)

ciberger commented 4 years ago

Hi!, as you correctly mentioned this should only work for models that can handle missing values such as novel boosting methods (i.e. xgboost).

Alternatively, nan_euclidean_distances function could serve to solve the issue at the expense of limiting the distance metric to euclidean. Thoughts?

cosmic-cortex commented 4 years ago

That is a good idea! I am going to take a shot this. I don't promise to do this ASAP since I am extremely busy with other work, but I'll try to do it this month.

modAL-python / modAL

Input data with different lengths / filled with NAs #58

Reprex (mostly from Ranked batch-mode sampling documentation)

Error message