huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

Distributed Training Error on Customized Dataset #5722

Closed wlhgtc closed 1 year ago

wlhgtc commented 1 year ago

Hi guys, recently I tried to use datasets to train a dual encoder. I finish my own datasets according to the nice tutorial Here are my code:

class RetrivalDataset(datasets.GeneratorBasedBuilder):
    """CrossEncoder dataset."""

    BUILDER_CONFIGS = [RetrivalConfig(name="DuReader")]

    # DEFAULT_CONFIG_NAME = "DuReader"

    def _info(self):
        return datasets.DatasetInfo(
            features=datasets.Features(
                {
                    "id": datasets.Value("string"),
                    "question": datasets.Value("string"),
                    "documents": Sequence(datasets.Value("string")),
                }
            ),
            supervised_keys=None,
        )

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""

        train_file = self.config.data_dir + self.config.train_file
        valid_file = self.config.data_dir + self.config.valid_file

        logger.info(f"Training  on {self.config.train_file}")
        logger.info(f"Evaluating on {self.config.valid_file}")

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN, gen_kwargs={"file_path": train_file}
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION, gen_kwargs={"file_path": valid_file}
            ),
        ]

    def _generate_examples(self, file_path):
        with jsonlines.open(file_path, "r") as f:
            for record in f:
                label = record["label"]
                question = record["question"]

                # dual encoder
                all_documents = record["all_documents"]
                positive_paragraph = all_documents.pop(label)
                all_documents = [positive_paragraph] + all_documents

                u_id = "{}_#_{}".format(
                    md5_hash(question + "".join(all_documents)),
                    "".join(random.sample(string.ascii_letters + string.digits, 7)),
                )
                item = {
                    "question": question,
                    "documents": all_documents,
                    "id": u_id,
                }
                yield u_id, item

It works well on single GPU, but got errors as follows when used DDP:

Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLGATHER_COALESCED)

Here are my train script on a two A100 mechine:

export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,COLL,ENV
nohup torchrun --nproc_per_node 2 train.py experiments/de-big.json >logs/de-big.log 2>&1&

I am not sure if this error below related to my dataset code when use DDP. And I notice the PR(#5369 ), but I don't know when and where should I used the function(split_dataset_by_node) .

@lhoestq hope you could help me?

lhoestq commented 1 year ago

Hmm the error doesn't seem related to data loading.

Regarding split_dataset_by_node: it's generally used to split an iterable dataset (e.g. when streaming) in pytorch DDP. It's not needed if you use a regular dataset since the pytorch DataLoader already assigns a subset of the dataset indices to each node.