text_completion_dataset removed?

wiiiktor commented 2 months ago

This line in my custom recipe does not work (the only one that I have added): from torchtune.datasets import text_completion_dataset

When I run tune, the message is: ImportError: cannot import name 'text_completion_dataset' from 'torchtune.datasets' (/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchtune/datasets/init.py)

Also, I have noticed that a similar import line: from torchtune.datasets import ConcatDataset is not present in a file that I receive after using command: tune cp lora_finetune_distributed ./custom_lora_finetune_distributed.py

To me, it looks like the torchtune.datasets library has been removed or updated. how can I use text_completion_dataset configuration now?

joecummings commented 2 months ago

Hi @wiiiktor - what version of the library are you using? In our latest stable release (v0.1.1), this dataset doesn't exist. However, if you upgrade to nightlies or install from source, you should see the appropriate API.

wiiiktor commented 2 months ago

THNX, it worked with the nightly build. You may add a comment on it in the manual here: https://pytorch.org/torchtune/main/tutorials/datasets.html#custom-unstructured-text-corpus

artisanclouddev commented 1 month ago

Hi , @joecummings , as the text_completion_dataset.

I am using the "git clone" to install torchtune, and use the text_completion_dataset train my dataset,

my json format is

{
"text": "xxx",
"text": "xxx",
"text": "xxx",
...
}

here is the config file:

# Config for multi-device LoRA finetuning in lora_finetune_distributed.py
# using a Llama3 8B Instruct model
#
# This config assumes that you've run the following command before launching
# this run:
#   tune download meta-llama/Meta-Llama-3-8B-Instruct --output-dir /tmp/Meta-Llama-3-8B-Instruct --hf-token <HF_TOKEN>
#
# To launch on 2 devices, run the following command from root:
#   tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
#   tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device LoRA finetuning please use 8B_lora_single_device.yaml
# or 8B_qlora_single_device.yaml

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: ./checkpoints/Meta-Llama-3-8B-Instruct/tokenizer.model

# Model Arguments
model:
  _component_: torchtune.models.llama3.lora_llama3_8b
  lora_attn_modules: ['q_proj', 'v_proj']
  apply_lora_to_mlp: False
  apply_lora_to_output: False
  lora_rank: 8
  lora_alpha: 16

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: ./checkpoints/Meta-Llama-3-8B-Instruct/
  checkpoint_files: [
    consolidated.00.pth
  ]
  recipe_checkpoint: null
  output_dir: ./tuned_checkpoints/design/
  model_type: LLAMA3
resume_from_checkpoint: False

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  split: train
  column: text
  source: ./data/design/
  max_seq_len: 512
seed: null
shuffle: True
batch_size: 2

# Optimizer and Scheduler
optimizer:
  _component_: torch.optim.AdamW
  weight_decay: 0.01
  lr: 3e-4
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100

loss:
  _component_: torch.nn.CrossEntropyLoss

# Training
epochs: 5
max_steps_per_epoch: null
gradient_accumulation_steps: 32

# Logging
output_dir: ./tuned_checkpoints/design/
metric_logger:
  _component_: torchtune.utils.metric_logging.DiskLogger
  log_dir: ${output_dir}
log_every_n_steps: 1
log_peak_memory_stats: False

# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: False

this is th log info about 5 epoches

Step 1 | loss:2.810748815536499 lr:2.9999999999999997e-06 tokens_per_second_per_gpu:215.64717233583934 
Step 2 | loss:2.782249689102173 lr:5.999999999999999e-06 tokens_per_second_per_gpu:204.9724443319711 
Step 3 | loss:2.835712432861328 lr:8.999999999999999e-06 tokens_per_second_per_gpu:202.16558347399325 
Step 4 | loss:2.7990615367889404 lr:1.1999999999999999e-05 tokens_per_second_per_gpu:201.7966954820391 
Step 5 | loss:2.9729669094085693 lr:1.4999999999999999e-05 tokens_per_second_per_gpu:209.84422263970436 
Step 6 | loss:2.8799641132354736 lr:1.7999999999999997e-05 tokens_per_second_per_gpu:215.3894470070926 

...

Step 339 | loss:2.080404043197632 lr:1.4307861455023218e-06 tokens_per_second_per_gpu:205.6468325490002 
Step 340 | loss:2.07681941986084 lr:1.1827948028283352e-06 tokens_per_second_per_gpu:201.3310088551471 
Step 341 | loss:2.222018003463745 lr:9.583034219987406e-07 tokens_per_second_per_gpu:202.5816292878222 
Step 342 | loss:2.0848352909088135 lr:7.573474528049739e-07 tokens_per_second_per_gpu:217.48834996019212 
Step 343 | loss:2.020090341567993 lr:5.799586285241242e-07 tokens_per_second_per_gpu:215.56537230457158 
Step 344 | loss:2.177356004714966 lr:4.2616496090790983e-07 tokens_per_second_per_gpu:218.51878881052903 
Step 345 | loss:2.1635489463806152 lr:2.959907357592661e-07 tokens_per_second_per_gpu:207.72459896123226 
Step 346 | loss:2.1038498878479004 lr:1.8945650909737985e-07 tokens_per_second_per_gpu:215.9351581299387 
Step 347 | loss:2.2861390113830566 lr:1.0657910391161928e-07 tokens_per_second_per_gpu:214.5035230548999 
Step 348 | loss:2.0461740493774414 lr:4.737160750500901e-08 tokens_per_second_per_gpu:204.32284864693142 
Step 349 | loss:2.1886043548583984 lr:1.1843369427583238e-08 tokens_per_second_per_gpu:207.14559235657984 
Step 350 | loss:2.154921054840088 lr:0.0 tokens_per_second_per_gpu:211.50423790988427

The Loss is cannot be lower any more from begining to the end, Is this correct loss for unsupervised training?

joecummings commented 1 month ago

Hey @artisanclouddev - couple follow up questions on this:

Which model are you using? Looks like Llama3 8B Instruct, but I want to confirm. Anecdotally, I get better performance with base models rather than instruct-tuned models when doing continued pre-training work.
How big is your dataset? Num samples and median length of samples.
Have you tried with add_eos: False?

Matrix-X commented 1 month ago

Thx for you reply, @joecummings

Hey @artisanclouddev - couple follow up questions on this:

Which model are you using? Looks like Llama3 8B Instruct, but I want to confirm. Anecdotally, I get better performance with base models rather than instruct-tuned models when doing continued pre-training work.

correct, Llama3 8B Instruct is what I used, let me try base model.

How big is your dataset? Num samples and median length of samples.

The sample file is size is around 16MB json file . So I try to use these sample data for lora. Is this size of sample file fit to lora training or full training?

Have you tried with add_eos: False?

I don't know where to add this option in the config yaml file. is this default value True?

RdoubleA commented 2 weeks ago

Hi @artisanclouddev / @Matrix-X, sorry for the late response here. Are you still running into issues? Based on your config, it seems that the text_completion_dataset is configured incorrectly:

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  split: train
  column: text
  source: ./data/design/
  max_seq_len: 512

If you are using a local json file you should be specifying source as json and pass in the filepath to data_files:

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  split: train
  column: text
  source: json
  data_files: ./data/design/my_file.json
  max_seq_len: 512

Matrix-X commented 1 week ago

thx for the reply , I'll try it later

pytorch / torchtune

text_completion_dataset removed? #1140