ngruver / llmtime

https://arxiv.org/abs/2310.07820
MIT License
628 stars 139 forks source link

Fine-tuning with tabular data #2

Closed sudokhan112 closed 9 months ago

sudokhan112 commented 9 months ago

Could you publish code/instructions on how to fine-tune with personal data?

shikaiqiu commented 9 months ago

Hi there! We don't have any fine-tuning code to share since our method is zero-shot (it directly runs inference on pre-trained LLMs). To fine-tune on your own data, you can perform the usual fine-tuning with any LLM of your choice once you convert the time series into strings in our format (see https://github.com/ngruver/llmtime/blob/main/models/llmtime.py#L209).

sudokhan112 commented 9 months ago

I looked into HF SFTrainer and lit-gpt library. All of them look like instructions finetune. where the dataset in format "question: answer". Can you explain/point me to some detailed instructions, how I can finetune a model like llama with a dataset like "titanic"?

shikaiqiu commented 9 months ago

The process would be identical to how you fine-tune LLaMA with a language modeling objective on any text data, once you convert your time series into strings with our format. You can find how to do this conversion here (https://github.com/ngruver/llmtime/blob/main/models/llmtime.py#L202-L209). Each entry in your dataset would simply be a string representing the time series (rather than a question/answer format) and you would train the model to do next-token prediction with that string.

I can't provide more detailed instructions because our experiments don't involve fine-tuning. Therefore, you might need to try things to find out how to get the details right (e.g. hyperparameters for preprocessing, how much history to condition on, etc.).

sudokhan112 commented 9 months ago

Is this only applicable for dataset which has time series and single column? Like 'time'-'value'. What if the dataset has multiple columns for each time step? How would that affect the dataset creation/finetune process?

shikaiqiu commented 9 months ago

I'd say how to best handle multivariate series in this framework is an open question. We mainly explored univariate time series (single column) in the paper. For multivariate series, you could first try simply modeling each column independently, which is what we did for the informer datasets and it worked well enough. Alternatively, you can include all columns with a format like x1[0] & x2[0] & x3[0], x1[1] & x2[1] & x3[1], ... where & is some special separator token to delimit the columns. But we haven't explored this yet.