opchatdataset.estimate_num_batches returns 0 at the beginning of training, and training-stuck problem

I just transformed a jsonl dataset according to the given Converation-Message data format, and then used the recomended script to process it into parquet format,and then I found that though processed dataset could be recognized and resolved as OchatDataset, the dataset returned empty batch length at calling estimate_num_batches , leading to training process exited without a single forwarding.

I checked OchatDataset.dataset object member, it was a dict with keys and vals like: total_length: it is an array like [1,2,3...] num_seqs: an array like [3.4, 441, 2.9 ...] seqlens: an array of multiple arries, it seems like : array[ array[977], array[1829], array[566],..] and input_ids , it is an array of arraies like array[ array_of_shape977, array_of_shape1829,...] .... problem

I do not know whether the dataset format is correct for your code logic, so here I ask for your review.

Besides, I also encountered stuck-at-checkpoint-loading problem of deepspeed, this problem occurred both on single node single gpu training and single-node-multiple-gpus training, I have no idea what is the cause of it.

imoneoi / openchat

opchatdataset.estimate_num_batches returns 0 at the beginning of training, and training-stuck problem #212