Closed SOVIETIC-BOSS88 closed 1 year ago
Hi,
Thank you for reporting the issue! My apologies for issues you encountered with our framework.
Token indices sequence length is longer than the specified maximum sequence length for this model (1614 > 1024). Running this sequence through the model will result in indexing errors
Yes, since some of the popular Transformer-based models are trained using with the max sequence length of 1024.
Is there a way to pass the seq = seq[:512] parameter?
This way will lead to a problem where GReaT won't be able to synthesize data for all features.
In my opinion, there are two possible ways to deal with this problem:
(1) You can change the n_positions hyperparameter (max token numbers) https://huggingface.co/transformers/v2.10.0/model_doc/gpt2.html
However, in this case, you will have mismatch between some of the weights, which is Okay because you still need to finetune the model.
In order to change the n_positions, you need to adjust the GReaT code. Please add the following hyperparameters togreat.py
(a line 63), a possible location of the file:
/usr/local/lib/python3.9/dist-packages/be_great/great.py
self.model = AutoModelForCausalLM.from_pretrained(self.llm, n_positions=2048, ignore_mismatched_sizes=True)
After the fine tuning step, you need to pass max_length
parameter to the sample function:
synthetic_data = model.sample(n_samples=100, max_length=1700)
Disclaimer, we haven't tested it, so can guarantee that it will work.
(2) You can also adjust your dataset, to make the input sequence shorter after the tokenization step. For example:
Please let me know if this helps! We will also adjust our framework for long sequences in the next release.
Thank you very much for the suggestions, and apologies for my late reply. Yesterday, I tried both suggestions. I modified the great.py source file, and also made the input seq. shorter. I also used the
df.astype('float16')
line of code, and reduced the batch size to 4, otherwise I did run into CUDA out of memory errors, like he following one:
OutOfMemoryError: CUDA out of memory. Tried to allocate 342.00 MiB (GPU 0; 14.76 GiB total capacity; 13.33 GiB already allocated; 113.75 MiB free; 13.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
With these specifications I was able to start the training of the model. I did not complete the training though, but tomorrow I will be able to confirm 100%.
Today, I forked the library in order to downgrade the packages, and test if it was possible to use the library with Python 3.7. For now I was not able to do it. Please ignore the erroneous pull request.
Cheers.
great!
It's really interesting to hear about your results. I would appreciate an update!
Some updates regarding the model training.
1) I tried to produce samples using the saved trained model, and the modified forked library. The model did load correctly. The issue was that when I called the sample
function it got stuck. After 19 mins. I stopped the execution. Here is how I called the method.
synthetic_data = model.sample(n_samples=20)
2) For this reason I started experimenting and trained 4 different models with the 2 versions of the library (2 different environments). I used the california housing dataset, and a reduced version of the previous personal dataset (209 rows × 20 columns).
I called the model function the following way:
model = GReaT(llm='distilgpt2', batch_size=32, epochs=20)
2.1) The california dataset: did train and produce samples without any issues using both the original library and the modified one.
2.2) Reduced personal dataset.
2.2.a) Trained using the modified library: it does train with no errors, but as before gets stuck during inference.
2.2.b) Trained using the original library
Original version of the library: it does train without errors, but during inference time I got the following error. I am puzzled, since the dataset I am using is 209 rows × 20 columns.
Here is the full trace.
IndexError Traceback (most recent call last) Cell In[18], line 1 ----> 1 synthetic_data = model.sample(n_samples=20)
File ~/anaconda3/envs/env/lib/python3.9/site-packages/be_great/great.py:162, in GReaT.sample(self, n_samples, start_col, start_col_dist, temperature, k, max_length, device) 160 # Convert tokens back to tabular data 161 text_data = _convert_tokens_to_text(tokens, self.tokenizer) --> 162 df_gen = _convert_text_to_tabular_data(text_data, df_gen) 164 # Remove rows with flawed numerical values 165 for i_num_cols in self.num_cols:
File ~/anaconda3/envs/env/lib/python3.9/site-packages/be_great/great_utils.py:91, in _convert_text_to_tabular_data(text, df_gen) 89 values = f.strip().split(" is ") 90 if values[0] in columns and not td[values[0]]: ---> 91 td[values[0]] = [values[1]] 93 df_gen = pd.concat([df_gen, pd.DataFrame(td)], ignore_index=True, axis=0) 94 return df_gen
IndexError: list index out of range
Hey, both issues can happen if the model was not fine-tuned long enough.
I updated the code recently to handle the index error, you can pull the newest version to fix this.
But this does not fix the underlying issue - for me it seems like the model did not yet learn to generate all of the 20 columns correctly. In the sample
-method there are some sanity checks to remove rows with corrupted or missing values (this happens usually in significantly less than 5% of the generated data). But if the model is not able to generate all columns correctly, it can result in an endless loop.
Maybe you can have a look at the textual output (text_data
) to understand your problem further.
Hi, thank you for the update. I was running some experiments during these past weeks, and wanted to be 100% sure, before replying.
First, I started with the dataset composed of 20 columns. I started increasing the number of training epochs from 50 up to 300, in 50 epoch steps. I am still getting the index error.
Second, did experiments, where I kept the number of epochs constant, but increasing the passed number of features. I fine tuned the model from 4 features up to 9 features, with only 50 epochs. The model produced samples that were logical.
The problems started again when I increased the number to 10 features. I started with 50 epochs. and went up to 300 epochs, in 50 epoch steps. Still the index problem persisted. I tried the same experiments with another dataset of similar size, but no success either.
Will keep experimenting and will update you on my progress.
Hi, I am having the following problem when I am using my library on my dataset. The dataset has only 209 samples, but 107 features. The values in the set are floats and ints.
This is the call I am making: model = GReaT(llm='gpt2', epochs=50, batch_size=32) model.fit(df)
This is what I assume is the reason behind the error: Token indices sequence length is longer than the specified maximum sequence length for this model (1614 > 1024). Running this sequence through the model will result in indexing errors
From what I can gather it seems a Huggingface issue. Is there a way to pass the seq = seq[:512] parameter?
Do you know a solution to this problem?
Any help would be much appreciated.
Here is the full trace: