stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
482 stars 42 forks source link

DistilBERTurk training for question answering failed #25

Open ekandemir opened 3 years ago

ekandemir commented 3 years ago

Hey, I tried to train DistilBERTurk model for question answering by using run_squad.py script. After training, I got the error during evaluation stage;

Traceback (most recent call last):
  File "run_squad.py", line 838, in <module>
    main()
  File "run_squad.py", line 827, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 344, in evaluate
    start_logits, end_logits = output
ValueError: too many values to unpack (expected 2)

When I tried to discard the last value as "start_logits, endlogits, = output" the error became

Traceback (most recent call last):
  File "run_squad.py", line 839, in <module>
    main()
  File "run_squad.py", line 828, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 323, in evaluate
    output = [to_list(output[i]) for output in outputs.to_tuple()]
  File "run_squad.py", line 323, in <listcomp>
    output = [to_list(output[i]) for output in outputs.to_tuple()]
IndexError: tuple index out of range

I checked the model with samples from the dataset and the confidence levels were really low, mostly below 0.001. I assume training couldn't done right either.

I tried to train DistilBERT original with the same script and the same dataset and it trained without error and confidence levels were high. I compared the layers but both model looked same. Also tried to load the model as qa model, saved it but the error occurred again.

Thank you so much.

stefan-it commented 3 years ago

Hi @ekandemir ,

thanks for your interest and using the distilled version :hugs:

Could you specify the exact Transformers version, that you use for fine-tuning :thinking:

I'm currently using:

python run_qa.py \
  --model_name_or_path dbmdz/distilbert-base-turkish-cased \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

with Transformers 4.3.0.dev0 (latest master).

(Yes, it is not a Turkish QA dataset, but fine-tuning is running).

Could you also paste the exact training command that you use?

stefan-it commented 3 years ago

Oh, I just saw that you're using the legacy script. Is there any chance that you use the new run_qa.py script?

I would be very interested in the Turkish QA dataset that you're using. If it's not available in the awesome Hugging Face datasets library, then we maybe could integrate it :hugs:

ekandemir commented 3 years ago

Thanks for the quick answer. I've been trying to run new script but due to windows machine and network restrictions, I couldn't get "datasets" run well. Also it didn't work with Turkish Squad and my customized dataset local files. I installed Transformers 4.3.0.dev0 (latest master) and run the command

python run_squad.py \
  --model_type distilbert \
  --model_name_or_path ../distilbert-base-turkish-cased  \
  --do_train   \
  --do_eval  \
  --train_file tquad/train-v1.1.json  \
  --predict_file tquad/dev-v1.1.json  \
  --per_gpu_train_batch_size 8  \
  --learning_rate 3e-5  \
  --num_train_epochs 1.0   \
  --max_seq_length 384  \
  --doc_stride 128  \
  --output_dir "./tmp/debug"

But same error. I should probably find a way to run the new script but if you have any guess with why old one crash, I would be thankful to hear it.

Turkish QA dataset is available on TQuad . And there are some example BERT models exist in huggingface finetuned with this dataset. PS: The dataset is not exactly in squad format so it needs a slight process.

Thanks again.

stefan-it commented 3 years ago

Hi @ekandemir ,

after some debugging I can confirm that there's something strange with the configuration of my distilbert model. Root cause can be found in the model configuration, output_hidden_states=True to be precisely. This option is not set in the "official" distilbert-base-cased model for example so the model will additionally output the hidden states, this can also be seen in your error message "too many values to unpack (expected 2)".

I will remove this option from the config and then evaluation should be fine (I checked it locally) and you should be able to use the old QA script then.

I've also written a new Hugging Face datasets recipe for TSQuAD, which I will integrate into datasets library soon.

Will report back here, whenever I changed the model configuration, @ekandemir !

(Thanks also to @sgugger for providing more information about that issue :hugs: )

sgugger commented 3 years ago

On our side, we'll work on fixing the scripts so the error does not appear if the option output_hidden_states=True is set in the config.

ekandemir commented 3 years ago

Changing the config file solved the problem training from the main model, thanks. But adding line output_hidden_states = false didn't help with fine-tuning QA model so I added model.config.output_hidden_states = False line to run_squad.py line 747 as a temp solution.

Thanks for your help.

stefan-it commented 3 years ago

Hi @ekandemir , great to hear that it would work with the old script.

Here's a first draft of a Hugging Face Datasets recipe.

Just create a folder named like squad_tr, create a file squad_tr.py in it with the following content:

from __future__ import absolute_import, division, print_function

import json

import datasets

# BibTeX citation
_CITATION = """
"""

_DESCRIPTION = """\
TSQuAD
"""

_URL = "https://raw.githubusercontent.com/TQuad/turkish-nlp-qa-dataset/master/"
_URLS = {
    "train": _URL + "train-v0.1.json",
    "dev": _URL + "dev-v0.1.json",
}

class SquadTrConfig(datasets.BuilderConfig):
    """BuilderConfig for TSQuAD."""

    def __init__(self, **kwargs):
        """BuilderConfig for TSQuAD.

        Args:
          **kwargs: keyword arguments forwarded to super.
        """
        super(SquadTrConfig, self).__init__(**kwargs)

class SquadTr(datasets.GeneratorBasedBuilder):
    """TSQuAD dataset."""

    VERSION = datasets.Version("0.1.0")

    BUILDER_CONFIGS = [
        SquadTrConfig(
            name="v1.1.0",
            version=datasets.Version("1.0.0", ""),
            description="Plain text Turkish squad version 1",
        ),
    ]

    def _info(self):
        # Specifies the datasets.DatasetInfo object
        return datasets.DatasetInfo(
            # This is the description that will appear on the datasets page.
            description=_DESCRIPTION,
            # datasets.features.FeatureConnectors
            features=datasets.Features(
                {
                    # These are the features of your dataset like images, labels ...
                    "id": datasets.Value("string"),
                    "title": datasets.Value("string"),
                    "context": datasets.Value("string"),
                    "question": datasets.Value("string"),
                    "answers": datasets.features.Sequence(
                        {
                            "text": datasets.Value("string"),
                            "answer_start": datasets.Value("int32"),
                        }
                    ),
                }
            ),
            # If there's a common (input, target) tuple from the features,
            # specify them here. They'll be used if as_supervised=True in
            # builder.as_dataset.
            supervised_keys=None,
            # Homepage of the dataset for documentation
            homepage="https://github.com/TQuad/turkish-nlp-qa-dataset",
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""
        # Downloads the data and defines the splits
        # dl_manager is a datasets.download.DownloadManager that can be used to

        # download and extract URLs
        dl_dir = dl_manager.download_and_extract(_URLS)

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={"filepath": dl_dir["train"]},
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={"filepath": dl_dir["dev"]},
            ),
        ]

    def _generate_examples(self, filepath):
        """Yields examples."""
        # Yields (key, example) tuples from the dataset
        with open(filepath, encoding="utf-8") as f:
            data = json.load(f)
            for example in data["data"]:
                title = example.get("title", "").strip()
                for paragraph in example["paragraphs"]:
                    context = paragraph["context"].strip()
                    for qa in paragraph["qas"]:
                        question = qa["question"].strip()
                        id_ = str(qa["id"])

                        answer_starts = [answer["answer_start"] for answer in qa["answers"]]
                        answers = [answer["text"].strip() for answer in qa["answers"]]

                        yield id_, {
                            "title": title,
                            "context": context,
                            "question": question,
                            "id": id_,
                            "answers": {
                                "answer_start": answer_starts,
                                "text": answers,
                            },
                        }

Then you can use the shiny new run_sq.py script, like:

$ python3 run_qa.py \
  --model_name_or_path dbmdz/distilbert-base-turkish-cased \
  --dataset_name ./squad_tr \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ./output-squad-tr

You may ask for good baseline comparisons. I recently found a great paper from @xplip and @JoPfeiff : "How Good is Your Tokenizer?" that also uses this QA dataset with the "normal" BERTurk model.

medical-projects commented 3 years ago

cheaters gonna cheat :D typical :D