Open nightscape opened 5 years ago
Seems like a relatively straightforward update might be to add force_all_finite: bool or str = True
as a parameter to BaseLearner
init and then pass self.force_all_finite
to check_X_y
in each function where it is called (_add_training_data
, _fit_on_new
and fit
).
Thanks for the observation and sorry for the late answer! I have just reached this task in my backlog :)
This case hasn't come up before. I don't see any reason to not allow NaNs, so we can just set force_all_finite = 'allow-nan'
in every call of check_X_y
. I like the solution of @zaksamalik, so I'll add this sometime during this week.
Cool, thanks a lot!
I have fixed a problem and additionally released the new version, this fix included. Let me know if there is a problem!
Hi, it seems the issue is still present in Ranked batch-mode sampling.
import numpy as np
import xgboost as xgb
from functools import partial
from modAL.batch import uncertainty_batch_sampling
from modAL.models import ActiveLearner
iris = load_iris()
X_raw = iris['data']
y_raw = iris['target']
# Isolate our examples for our labeled dataset.
n_labeled_examples = X_raw.shape[0]
training_indices = np.random.randint(low=0, high=n_labeled_examples + 1, size=3)
X_train = X_raw[training_indices]
y_train = y_raw[training_indices]
# Isolate the non-training examples we'll be querying.
X_pool = np.delete(X_raw, training_indices, axis=0)
y_pool = np.delete(y_raw, training_indices, axis=0)
# Setting an column's entry as np.nan
X_pool[0][0] = np.nan
# Pre-set our batch sampling to retrieve 3 samples at a time.
BATCH_SIZE = 3
preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)
# Specify our active learning model.
learner = ActiveLearner(
estimator=xgb.XGBClassifier(),
X_training=X_train,
y_training=y_train,
query_strategy=preset_batch,
force_all_finite=False
)
query_index, query_instance = learner.query(X_pool)
Hi!
This seems like a scikit-learn issue :( The function pairwise_distances_argmin_min
is called, which throws the error upon encountering the NaN. Unfortunately, there is no way to control this in the function.
Similarly, if you set force_all_finite=False
but use an estimator which doesn't support this (like the ones in scikit-learn), it won't work, even though modAL allows you to use data with NaNs.
Do you have any suggestions how to solve this? At the moment, I don't see a proper solution, but this doesn't mean that there isn't one. (I don't want to internally remove NaNs and pass them to the external functions, because this would remain hidden from the user, possibly causing unintended consequences.)
Hi!, as you correctly mentioned this should only work for models that can handle missing values such as novel boosting methods (i.e. xgboost).
Alternatively, nan_euclidean_distances
function could serve to solve the issue at the expense of limiting the distance metric to euclidean. Thoughts?
That is a good idea! I am going to take a shot this. I don't promise to do this ASAP since I am extremely busy with other work, but I'll try to do it this month.
I'm trying to use modAL in combination with tslearn to classify timeseries of different lengths. tslearn supports variable-length time series by filling the shorter time series up with NAs, but modAL calls
without setting
force_all_finite = 'allow-nan'
. Is there a reason for not allowing NAs, or did this use case just not come up before?Thanks a lot!