imoneoi / openchat

OpenChat: Advancing Open-source Language Models with Imperfect Data
https://openchat.team
Apache License 2.0
5.23k stars 399 forks source link

opchatdataset.estimate_num_batches returns 0 at the beginning of training, and training-stuck problem #212

Closed syboomsy closed 5 months ago

syboomsy commented 5 months ago

I just transformed a jsonl dataset according to the given Converation-Message data format, and then used the recomended script to process it into parquet format,and then I found that though processed dataset could be recognized and resolved as OchatDataset, the dataset returned empty batch length at calling estimate_num_batches , leading to training process exited without a single forwarding.

I checked OchatDataset.dataset object member, it was a dict with keys and vals like: total_length: it is an array like [1,2,3...] num_seqs: an array like [3.4, 441, 2.9 ...] seqlens: an array of multiple arries, it seems like : array[ array[977], array[1829], array[566],..] and input_ids , it is an array of arraies like array[ array_of_shape977, array_of_shape1829,...] .... problem

I do not know whether the dataset format is correct for your code logic, so here I ask for your review.

Besides, I also encountered stuck-at-checkpoint-loading problem of deepspeed, this problem occurred both on single node single gpu training and single-node-multiple-gpus training, I have no idea what is the cause of it.

syboomsy commented 5 months ago

problem3 file location :ochat.training_deepspeed.multipack_sampler::allocate I am sure that the exit-at-training-start problem's crux is here: ffd_with_result interface only returns one batch, and n represents the worlds-size, c is batch_max_length(maybe meaning batch per gpu?), the returned batch length is 1, and if I use mutiple devices, n >=2, so the allocate interface return empty batch to upper layer callers, but I just want to know why? is there any problem in my dataset format?