extension file .pt (preprocess.py)

m6129 commented 5 months ago

Hello, Dear developers Am I correct that the .pt file can only be obtained via llama model? Is it possible to get gov datasets translated into .pt format somewhere? Thanks

Tried to implement without llama on Kaggle but failed miserably

hvlgo commented 5 months ago

The embeddings generated by small language models may not effectively capture their semantic information. Although it's possible to generate corresponding .pt files using GPT-2, the quality of GPT-2 embeddings is relatively low and may not necessarily provide significant help for predictions. If you're interested, you can give it a try. Additionally, you have two options: either remove the --mix_embeds parameter from the script and modify the dataloader to remove the code related to loading pt files, or use LLAMA for inference, with the corresponding preprocessed .pt files provided in the readme link.

m6129 commented 5 months ago

The embeddings generated by small language models may not effectively capture their semantic information. Although it's possible to generate corresponding .pt files using GPT-2, the quality of GPT-2 embeddings is relatively low and may not necessarily provide significant help for predictions. If you're interested, you can give it a try. Additionally, you have two options: either remove the --mix_embeds parameter from the script and modify the dataloader to remove the code related to loading pt files, or use LLAMA for inference, with the corresponding preprocessed .pt files provided in the readme link.

Thanks. Maybe you have a ready archive with embedding-generated datasets ETTh1, ETTh2, ETTm1, ETTm2, weather, exchange_rate and illness?

hvlgo commented 5 months ago

Here are the .pt files of the relevant datasets extracted with llama 7b: [Google Drive] [Tsinghua Cloud]

m6129 commented 5 months ago

Here are the .pt files of the relevant datasets extracted with llama 7b: [Google Drive] [Tsinghua Cloud]

Thanks. Too bad it's only ETTh1, weather.

hvlgo commented 5 months ago

You can easily obtain embeddings for other datasets through preprocess.py, all you need is an RTX 3090.

m6129 commented 5 months ago

You can easily obtain embeddings for other datasets through preprocess.py, all you need is an RTX 3090.

I don't have an RTX 3090 or similar GPU, just kaggle, and I assume some people will also run into this problem.

By the way, could you tell me what's wrong?
Thanks

cuda:0
config.json: 100%|█████████████████████████████| 665/665 [00:00<00:00, 3.24MB/s]
model.safetensors: 100%|██████████████████████| 548M/548M [00:01<00:00, 293MB/s]
use linear as tokenizer and detokenizer
>>>>>>>start training : long_term_forecast_ETTh1_672_96_AutoTimes_Gpt2_ETTh1_sl672_ll576_tl96_lr0.0005_bt256_wd0_hd256_hl0_cosTrue_mixTrue_test_0>>>>>>>>>>>>>>>>>>>>>>>>>>
Traceback (most recent call last):
  File "/kaggle/working/AutoTimes/run.py", line 124, in <module>
    exp.train(setting)
  File "/kaggle/working/AutoTimes/exp/exp_long_term_forecasting.py", line 104, in train
    train_data, train_loader = self._get_data(flag='train')
  File "/kaggle/working/AutoTimes/exp/exp_long_term_forecasting.py", line 33, in _get_data
    data_set, data_loader = data_provider(self.args, flag)
  File "/kaggle/working/AutoTimes/data_provider/data_factory.py", line 32, in data_provider
    data_set = Data(
  File "/kaggle/working/AutoTimes/data_provider/data_loader.py", line 33, in __init__
    self.__read_data__()
  File "/kaggle/working/AutoTimes/data_provider/data_loader.py", line 39, in __read_data__
    df_raw = pd.read_csv(os.path.join(self.root_path,
  File "/opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 547, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 636, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

thuml / AutoTimes

extension file .pt (preprocess.py) #3