Closed etemiz closed 1 month ago
Hey @etemiz! Thanks for checking out torchtune : ))
We support continued pre-training through our text_completion_dataset
.
The example I linked actually shows you how to specify this dataset for your own txt file. You can then use this dataset configuration for any of our SFT recipes: LoRA/full-finetune and single-device/distributed.
Let me know if this makes sense.
That worked well. But I am stuck again.
File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torchtune/datasets/_text_completion.py", line 55, in _prepare_sample prompt = sample[self._column] KeyError: None
Do I have to convert txt file to something of a json nature ?
{ "text": "unstructured text goes here" }
Hi @etemiz, assuming you just have a raw text file with no column names, can you try using the default value for column? From the docstring for text_completion_dataset
:
For local datasets with a single column, use the default "text", which is what is assigned by Hugging Face datasets when loaded into memory. Default is "text".
I didn't touch the default. My yaml:
dataset:
_component_: torchtune.datasets.text_completion_dataset
source: text
data_files: /tmp/all.txt
split: train
all.txt is a normal text file with long lines. Each line is a topic.
Hmm I see. Do you mind sharing the torchtune version you are on as well?
torchtune 0.2.1
Could you try change your yaml to:
dataset:
_component_: torchtune.datasets.text_completion_dataset
source: text
data_files: /tmp/all.txt
split: train
column: text
i.e. adding the column
entry.
@RdoubleA if I'm reading correctly, the text completion builder defaults column=None
, but the class will do prompt = sample[self._column]
, hence the error above. Should we default to column="text"
in the builder?
Since datasets seems to default to the text
column, also when loading from local? Please correct me here, though.
from datasets import load_dataset
x = load_dataset("text",
data_files="./my_data.txt",
split="train"
)
x[0]
>>> {'text': 'Build a configurable dataset from a freeform, unstructured text corpus similar'}
That worked. One warning and another error appeared. Warning
torchao/dtypes/nf4tensor.py:870: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at ../aten/src/ATen/Context.cpp:288.)
Error
File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 190, in forward
return F.embedding(
File "/home/dead/ml/torchtune/torchtune/v/lib/python3.10/site-packages/torch/nn/functional.py", line 2551, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
my cmd line
tune run lora_finetune_single_device --config recipes/configs/llama3_1/8B_qlora_single_device.yaml
Oh man, I didn't realize the builder doesn't use "text" as default column, but the class itself does. thanks for pointing this out @SalmanMohammadi / @etemiz. I'll put up a fix.
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
Can you check to see if you have any empty lines in your text file? This looks similar to #1191
Can you check to see if you have any empty lines in your text file? This looks similar to #1191
If this turns out to be the issue, happy to add this as a note in the tutorial?
I erased the empty lines and it worked. Thanks
Let me know if we can help in any other way @etemiz!
Otherwise, are you happy to close this off?
Is there a way to do pre training using txt files? I am seeing mostly fine tuning recipes. Thanks