kathrinse / be_great

A novel approach for synthesizing tabular data using pretrained large language models
MIT License
276 stars 46 forks source link

TypeError: '<' not supported between instances of 'list' and 'int' #12

Closed seygodin closed 1 year ago

seygodin commented 1 year ago

Hi, I tried to use your library for my research. While testing its code, an error occurred. I just used the sample code you provided. Would let me know what's the matter?

My python version is 3.9.16

from be_great import GReaT from sklearn.datasets import fetch_california_housing data = fetch_california_housing(as_frame=True).frame model = GReaT(llm='distilgpt2', batch_size=32, epochs=50) model.fit(data) synthetic_data = model.sample(n_samples=100) TypeError Traceback (most recent call last) Cell In[17], line 7 4 data = fetch_california_housing(as_frame=True).frame 6 model = GReaT(llm='distilgpt2', batch_size=32, epochs=50) ----> 7 model.fit(data) 8 synthetic_data = model.sample(n_samples=100) File /usr/local/lib/python3.9/dist-packages/be_great/great.py:114, in GReaT.fit(self, data, column_names, conditional_col, resume_from_checkpoint) 112 # Start training 113 logging.info("Start training...") --> 114 great_trainer.train(resume_from_checkpoint=resume_from_checkpoint) 115 return great_trainer File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1538 self.model_wrapped = self.model 1540 inner_training_loop = find_executable_batch_size( 1541 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size 1542 ) -> 1543 return inner_training_loop( 1544 args=args, 1545 resume_from_checkpoint=resume_from_checkpoint, 1546 trial=trial, 1547 ignore_keys_for_eval=ignore_keys_for_eval, 1548 ) File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1765, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 1762 self._load_rng_state(resume_from_checkpoint) 1764 step = -1 -> 1765 for step, inputs in enumerate(epoch_iterator): 1766 1767 # Skip past any already trained steps if resuming training 1768 if steps_trained_in_current_epoch > 0: 1769 steps_trained_in_current_epoch -= 1 File /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:628, in _BaseDataLoaderIter.__next__(self) 625 if self._sampler_iter is None: 626 # TODO(https://github.com/pytorch/pytorch/issues/76750) 627 self._reset() # type: ignore[call-arg] --> 628 data = self._next_data() 629 self._num_yielded += 1 630 if self._dataset_kind == _DatasetKind.Iterable and \ 631 self._IterableDataset_len_called is not None and \ 632 self._num_yielded > self._IterableDataset_len_called: File /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:671, in _SingleProcessDataLoaderIter._next_data(self) 669 def _next_data(self): 670 index = self._next_index() # may raise StopIteration --> 671 data = self._dataset_fetcher.fetch(index) # may raise StopIteration 672 if self._pin_memory: 673 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) File /usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py:56, in _MapDatasetFetcher.fetch(self, possibly_batched_index) 54 if self.auto_collation: 55 if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__: ---> 56 data = self.dataset.__getitems__(possibly_batched_index) 57 else: 58 data = [self.dataset[idx] for idx in possibly_batched_index] File /usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py:2662, in Dataset.__getitems__(self, keys) 2660 def __getitems__(self, keys: List) -> List: 2661 """Can be used to get a batch using a list of integers indices.""" -> 2662 batch = self.__getitem__(keys) 2663 n_examples = len(batch[next(iter(batch))]) 2664 return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] File /usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py:2658, in Dataset.__getitem__(self, key) 2656 def __getitem__(self, key): # noqa: F811 2657 """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).""" -> 2658 return self._getitem(key) File /usr/local/lib/python3.9/dist-packages/be_great/great_dataset.py:31, in GReaTDataset._getitem(self, key, decoded, **kwargs) 26 """ Get Item from Tabular Data 27 28 Get one instance of the tabular data, permuted, converted to text and tokenized. 29 """ 30 # If int, what else? ---> 31 row = self._data.fast_slice(key, 1) 33 shuffle_idx = list(range(row.num_columns)) 34 random.shuffle(shuffle_idx) File /usr/local/lib/python3.9/dist-packages/datasets/table.py:135, in IndexedTableMixin.fast_slice(self, offset, length) 127 def fast_slice(self, offset=0, length=None) -> pa.Table: 128 """ 129 Slice the Table using interpolation search. 130 The behavior is the same as `pyarrow.Table.slice` but it's significantly faster. (...) 133 The batches to keep are then concatenated to form the sliced Table. 134 """ --> 135 if offset < 0: 136 raise IndexError("Offset must be non-negative") 137 elif offset >= self._offsets[-1] or (length is not None and length <= 0): TypeError: '<' not supported between instances of 'list' and 'int'
AlessioGFerraioli commented 1 year ago

Hi, I'm experiencing the exact same issue, I tried with several versions of Python 3.9, 3.10, and 3.11, but I always get the same TypeError. Do you have any hints on what the problem might be?

unnir commented 1 year ago

Hi All,

This is due to the update of the huggingsface package datasets. To avoid this error please do:

pip install datasets==2.5.2
unnir commented 1 year ago

The new version solved the issue, please upgrade the package:

pip install -U be_great

seygodin commented 1 year ago

Good. Now it works.