Closed m6129 closed 5 months ago
The embeddings generated by small language models may not effectively capture their semantic information. Although it's possible to generate corresponding .pt files using GPT-2, the quality of GPT-2 embeddings is relatively low and may not necessarily provide significant help for predictions. If you're interested, you can give it a try. Additionally, you have two options: either remove the --mix_embeds parameter from the script and modify the dataloader to remove the code related to loading pt files, or use LLAMA for inference, with the corresponding preprocessed .pt files provided in the readme link.
The embeddings generated by small language models may not effectively capture their semantic information. Although it's possible to generate corresponding .pt files using GPT-2, the quality of GPT-2 embeddings is relatively low and may not necessarily provide significant help for predictions. If you're interested, you can give it a try. Additionally, you have two options: either remove the --mix_embeds parameter from the script and modify the dataloader to remove the code related to loading pt files, or use LLAMA for inference, with the corresponding preprocessed .pt files provided in the readme link.
Thanks. Maybe you have a ready archive with embedding-generated datasets ETTh1, ETTh2, ETTm1, ETTm2, weather, exchange_rate and illness?
Here are the .pt files of the relevant datasets extracted with llama 7b: [Google Drive] [Tsinghua Cloud]
Here are the .pt files of the relevant datasets extracted with llama 7b: [Google Drive] [Tsinghua Cloud]
Thanks. Too bad it's only ETTh1, weather.
You can easily obtain embeddings for other datasets through preprocess.py, all you need is an RTX 3090.
You can easily obtain embeddings for other datasets through preprocess.py, all you need is an RTX 3090.
I don't have an RTX 3090 or similar GPU, just kaggle, and I assume some people will also run into this problem.
By the way, could you tell me what's wrong?
Thanks
cuda:0
config.json: 100%|█████████████████████████████| 665/665 [00:00<00:00, 3.24MB/s]
model.safetensors: 100%|██████████████████████| 548M/548M [00:01<00:00, 293MB/s]
use linear as tokenizer and detokenizer
>>>>>>>start training : long_term_forecast_ETTh1_672_96_AutoTimes_Gpt2_ETTh1_sl672_ll576_tl96_lr0.0005_bt256_wd0_hd256_hl0_cosTrue_mixTrue_test_0>>>>>>>>>>>>>>>>>>>>>>>>>>
Traceback (most recent call last):
File "/kaggle/working/AutoTimes/run.py", line 124, in <module>
exp.train(setting)
File "/kaggle/working/AutoTimes/exp/exp_long_term_forecasting.py", line 104, in train
train_data, train_loader = self._get_data(flag='train')
File "/kaggle/working/AutoTimes/exp/exp_long_term_forecasting.py", line 33, in _get_data
data_set, data_loader = data_provider(self.args, flag)
File "/kaggle/working/AutoTimes/data_provider/data_factory.py", line 32, in data_provider
data_set = Data(
File "/kaggle/working/AutoTimes/data_provider/data_loader.py", line 33, in __init__
self.__read_data__()
File "/kaggle/working/AutoTimes/data_provider/data_loader.py", line 39, in __read_data__
df_raw = pd.read_csv(os.path.join(self.root_path,
File "/opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
self._engine = self._make_engine(f, self.engine)
File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 547, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 636, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
Hello, Dear developers Am I correct that the .pt file can only be obtained via llama model? Is it possible to get gov datasets translated into .pt format somewhere? Thanks
Tried to implement without llama on Kaggle but failed miserably