lmassaron / deep_learning_for_tabular_data

A presention of core concepts and a data generator making easier using tabular data with TensorFlow and Keras
40 stars 11 forks source link

error "Passing list-likes to .loc or [] with any missing labels is no longer supported." #3

Closed xuzhang5788 closed 3 years ago

xuzhang5788 commented 3 years ago

I used my own data to run your code. My model is regression. I followed your code and it is okay for catboost, but for deeplearning part, I got the following error messages:

KeyError Traceback (most recent call last)

in 52 shuffle=True) 53 ---> 54 history = model.fit(train_batch, 55 # validation_data=(tb.transform(X.iloc[test_idx]), y[test_idx]), 56 validation_data=test_batch, ~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing) 1048 training_utils.RespectCompiledTrainableState(self): 1049 # Creates a `tf.data.Dataset` and handles batch and epoch iteration. -> 1050 data_handler = data_adapter.DataHandler( 1051 x=x, 1052 y=y, ~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weight, batch_size, steps_per_epoch, initial_epoch, epochs, shuffle, class_weight, max_queue_size, workers, use_multiprocessing, model, steps_per_execution) 1098 1099 adapter_cls = select_data_adapter(x, y) -> 1100 self._adapter = adapter_cls( 1101 x, 1102 y, ~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weights, shuffle, workers, use_multiprocessing, max_queue_size, model, **kwargs) 900 self._keras_sequence = x 901 self._enqueuer = None --> 902 super(KerasSequenceAdapter, self).__init__( 903 x, 904 shuffle=False, # Shuffle is handed in the _make_callable override. ~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weights, workers, use_multiprocessing, max_queue_size, model, **kwargs) 777 # Since we have to know the dtype of the python generator when we build the 778 # dataset, we have to look at a batch to infer the structure. --> 779 peek, x = self._peek_and_restore(x) 780 peek = self._standardize_batch(peek) 781 peek = _process_tensorlike(peek) ~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in _peek_and_restore(x) 911 @staticmethod 912 def _peek_and_restore(x): --> 913 return x[0], x 914 915 def _handle_multiprocessing(self, x, workers, use_multiprocessing, ~/projects/ifp85/tabular.py in __getitem__(self, index) 348 def __getitem__(self, index): 349 indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size] --> 350 samples, labels = self.__data_generation(indexes) 351 return samples, labels 352 ~/projects/ifp85/tabular.py in __data_generation(self, selection) 342 return dct, self.y[selection] 343 else: --> 344 return self.tbt.transform(self.X.iloc[selection, :]), self.y[selection] 345 else: 346 return self.X.iloc[selection, :], self.y[selection] ~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key) 904 return self._get_values(key) 905 --> 906 return self._get_with(key) 907 908 def _get_with(self, key): ~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/series.py in _get_with(self, key) 939 # (i.e. self.iloc) or label-based (i.e. self.loc) 940 if not self.index._should_fallback_to_positional(): --> 941 return self.loc[key] 942 else: 943 return self.iloc[key] ~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key) 877 878 maybe_callable = com.apply_if_callable(key, self.obj) --> 879 return self._getitem_axis(maybe_callable, axis=axis) 880 881 def _is_scalar_access(self, key: Tuple): ~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1097 raise ValueError("Cannot index with multidimensional key") 1098 -> 1099 return self._getitem_iterable(key, axis=axis) 1100 1101 # nested tuple slicing ~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis) 1035 1036 # A collection of keys -> 1037 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False) 1038 return self.obj._reindex_with_indexers( 1039 {axis: [keyarr, indexer]}, copy=True, allow_dups=True ~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1252 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1253 -> 1254 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) 1255 return keyarr, indexer 1256 ~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1313 1314 with option_context("display.max_seq_items", 10, "display.width", 80): -> 1315 raise KeyError( 1316 "Passing list-likes to .loc or [] with any missing labels " 1317 "is no longer supported. " KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([ 963, 26089, 37285, 32796, 21419,\n ...\n 7514, 35430, 5619, 9022, 40319],\n dtype='int64', length=253). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike" I couldn't know how to solve this. By the way, I don't fully understand the meaning of variables sizes and categorical_levels tb = TabularTransformer(numeric = numeric_variables, ordinal = [], lowcat = [], highcat = categorical_variables) tb.fit(X.iloc[train_idx]) sizes = tb.shape(X.iloc[train_idx]) categorical_levels = dict(zip(categorical_variables, sizes[1:])) print(f"Input array sizes: {sizes}") print(f"Categorical levels: {categorical_levels}\n") Thank you very much!
lmassaron commented 3 years ago

The problem seems here: self.X.iloc[selection, :], self.y[selection]. Probably X or y are not as expected. Try passing y as a Numpy array. Instead as for as sizes and categorical_levels, you need them for building correctly your DNN network, since sizes provide the dimensionality of the numeric array produced by the TabularTransformer (you need it for the input for numerical vairables) and categorical_levels are necessary for sizing correctly the embedding layers of the DNN, since embeddings need that information or they won't work properly.

Just let me know if y a Numpy array works for you. After confirmation I will work out a more robust pipeline for it.

xuzhang5788 commented 3 years ago

Thank you. It works now. Could you please explain why categorical_levels are different for each fold in your example, but they are the same in my dataset?
In addition, for example, my categorical column1 has 19 unique values, why is categorical_levels=19+2=21? Thank you!

lmassaron commented 3 years ago

Perfect, I've already pushed some changes on the repository in order to deal with the target variable in case it is sent as a list or a pandas Series instead of being a Numpy array.

As for categorical_levels, during cross-validation they are on the spot encoded. Therefore for sampling reason some classes may be missing from your training fold, and you may have a different level counting in comparison to other folds where they are not missing.

Cross-validation is usually meant, in real world application, for testing purposes. Therefore I do not encoded on the full data that I have available, but only on the data that I use for train. That's simulates better the real-world testing that the model will have to undergo later when in production.

For the same reason, you find two extra classes that are reserved for 1. missing data 2. unknown data. In fact, it may well happen that you don't have missing data in train, but missingness are present on the test data. Moreover frequently it can happen that you find new levels in test, so you also have to take that into account by a special encoding.

Basically such are just placeholders because if you don't have missing values in train and you don't have anything unkown (well, for definition you do not have anything like that in train) such placeholders will just have random initialitiations and that's all, no meaningful weights will be elaborated during training. Yet their present will allow the model to still perform and do not break down because of unexpected input.