pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.07k stars 377 forks source link

Pre training #1257

Closed etemiz closed 1 month ago

etemiz commented 2 months ago

Is there a way to do pre training using txt files? I am seeing mostly fine tuning recipes. Thanks

SalmanMohammadi commented 2 months ago

Hey @etemiz! Thanks for checking out torchtune : ))

We support continued pre-training through our text_completion_dataset.

The example I linked actually shows you how to specify this dataset for your own txt file. You can then use this dataset configuration for any of our SFT recipes: LoRA/full-finetune and single-device/distributed.

Let me know if this makes sense.

etemiz commented 2 months ago

That worked well. But I am stuck again.

File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torchtune/datasets/_text_completion.py", line 55, in _prepare_sample prompt = sample[self._column] KeyError: None

Do I have to convert txt file to something of a json nature ?

{ "text": "unstructured text goes here" }

RdoubleA commented 2 months ago

Hi @etemiz, assuming you just have a raw text file with no column names, can you try using the default value for column? From the docstring for text_completion_dataset:

For local datasets with a single column, use the default "text", which is what is assigned by Hugging Face datasets when loaded into memory. Default is "text".

etemiz commented 2 months ago

I didn't touch the default. My yaml:

dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: text
  data_files: /tmp/all.txt
  split: train

all.txt is a normal text file with long lines. Each line is a topic.

RdoubleA commented 2 months ago

Hmm I see. Do you mind sharing the torchtune version you are on as well?

etemiz commented 2 months ago

torchtune 0.2.1

SalmanMohammadi commented 2 months ago

Could you try change your yaml to:

dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: text
  data_files: /tmp/all.txt
  split: train
  column: text

i.e. adding the column entry. @RdoubleA if I'm reading correctly, the text completion builder defaults column=None, but the class will do prompt = sample[self._column], hence the error above. Should we default to column="text" in the builder?

Since datasets seems to default to the text column, also when loading from local? Please correct me here, though.

from datasets import load_dataset
x = load_dataset("text",
    data_files="./my_data.txt",
    split="train"
    )
x[0]
>>> {'text': 'Build a configurable dataset from a freeform, unstructured text corpus similar'}
etemiz commented 2 months ago

That worked. One warning and another error appeared. Warning

torchao/dtypes/nf4tensor.py:870: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at ../aten/src/ATen/Context.cpp:288.)

Error

  File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
  File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

my cmd line

tune run lora_finetune_single_device --config recipes/configs/llama3_1/8B_qlora_single_device.yaml
RdoubleA commented 2 months ago

Oh man, I didn't realize the builder doesn't use "text" as default column, but the class itself does. thanks for pointing this out @SalmanMohammadi / @etemiz. I'll put up a fix.

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Can you check to see if you have any empty lines in your text file? This looks similar to #1191

SalmanMohammadi commented 2 months ago

Can you check to see if you have any empty lines in your text file? This looks similar to #1191

If this turns out to be the issue, happy to add this as a note in the tutorial?

etemiz commented 2 months ago

I erased the empty lines and it worked. Thanks

SalmanMohammadi commented 1 month ago

Let me know if we can help in any other way @etemiz!

Otherwise, are you happy to close this off?

RdoubleA commented 1 month ago

If this turns out to be the issue, happy to add this as a note in the tutorial?

this might be good, although after the fix the empty lines should no longer cause an error