stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
494 stars 42 forks source link

CHEATSHEET for ConvBERT? #27

Open finardi opened 3 years ago

finardi commented 3 years ago

Hi guys, impressive result with ConvBERT, there is any cheatsheet of how to train from scratch?

Your BERT and ELECTRA cheatsheets are very helpful.

stefan-it commented 3 years ago

Hi @finardi ,

sorry for the late reply!

It's definitely on my to-do list and should be added this week :hugs:

finardi commented 3 years ago

Thank you for the update Stefan :)

finardi commented 3 years ago

Hi stefan, I'm trying to train from scratch the ConvBERT. Do you use the HuggingFace or official implementation (TF1.15) script?

Thanks in advance!

stefan-it commented 3 years ago

Hi @finardi,

I used the official implementation from here. I trained the base model for 1M steps on a v3-32 TPU.

The configuration file looks like:

# coding=utf-8

"""Config controlling hyperparameters for pre-training."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

class PretrainingConfig(object):
  """Defines pre-training hyperparameters."""

  def __init__(self, model_name, data_dir, **kwargs):
    self.model_name = model_name
    self.debug = False  # debug mode
    self.do_train = True  # pre-train
    self.do_eval = False  # evaluate generator/discriminator on unlabeled data

    # loss functions
    self.electra_objective = True  # if False, use the BERT objective instead
    self.gen_weight = 1.0  # masked language modeling / generator loss
    self.disc_weight = 50.0  # discriminator loss
    self.mask_prob = 0.15  # percent of input tokens to mask out / replace

    # optimization
    self.learning_rate = 5e-4
    self.lr_decay_power = 1.0  # linear weight decay by default
    self.weight_decay_rate = 0.01
    self.num_warmup_steps = 10000

    # training settings
    self.iterations_per_loop = 200
    self.save_checkpoints_steps = 100000
    self.num_train_steps = 1000000
    self.num_eval_steps = 100
    self.keep_checkpoint_max = 0

    # model settings
    self.model_size = "base"  # one of "small", "medium-smal", or "base"
    # override the default transformer hparams for the provided model size; see
    # modeling.BertConfig for the possible hparams and util.training_utils for
    # the defaults
    self.model_hparam_overrides = (
        kwargs["model_hparam_overrides"]
        if "model_hparam_overrides" in kwargs else {})
    self.embedding_size = None  # bert hidden size by default
    self.vocab_size = 32000  # number of tokens in the vocabulary
    self.do_lower_case = False  # lowercase the input?

    # ConvBERT additional config
    self.conv_kernel_size=9
    self.linear_groups=2
    self.head_ratio=2
    self.conv_type="sdconv"
    # generator settings
    self.uniform_generator = False  # generator is uniform at random
    self.untied_generator_embeddings = False  # tie generator/discriminator
                                              # token embeddings?
    self.untied_generator = True  # tie all generator/discriminator weights?
    self.generator_layers = 1.0  # frac of discriminator layers for generator
    self.generator_hidden_size = 0.25  # frac of discrim hidden size for gen
    self.disallow_correct = False  # force the generator to sample incorrect
                                   # tokens (so 15% of tokens are always
                                   # fake)
    self.temperature = 1.0  # temperature for sampling from generator

    # batch sizes
    self.max_seq_length = 512
    self.train_batch_size = 128
    self.eval_batch_size = 128

    # TPU settings
    self.use_tpu = True
    self.tpu_job_name = None
    self.num_tpu_cores = 32
    self.tpu_name = "convbert"  # cloud TPU to use for training
    self.tpu_zone = None  # GCE zone where the Cloud TPU is located in
    self.gcp_project = None  # project name for the Cloud TPU-enabled project

    # default locations of data files
    self.pretrain_tfrecords = os.path.join(
        data_dir, "output-512/pretrain_data.tfrecord*")
    self.vocab_file = os.path.join(data_dir, "vocab.txt")
    self.model_dir = os.path.join(data_dir, "models", model_name)
    results_dir = os.path.join(self.model_dir, "results")
    self.results_txt = os.path.join(results_dir, "unsup_results.txt")
    self.results_pkl = os.path.join(results_dir, "unsup_results.pkl")

    # update defaults with passed-in hyperparameters
    self.update(kwargs)

    self.max_predictions_per_seq = int((self.mask_prob + 0.005) *
                                       self.max_seq_length)

    # debug-mode settings
    if self.debug:
      self.train_batch_size = 8
      self.num_train_steps = 20
      self.eval_batch_size = 4
      self.iterations_per_loop = 1
      self.num_eval_steps = 2

    # defaults for different-sized model
    if self.model_size in ["medium-small"]:
      self.embedding_size = 128
      self.conv_kernel_size=9
      self.linear_groups=2
      self.head_ratio=2
    elif self.model_size in ["small"]:
      self.embedding_size = 128
      self.conv_kernel_size=9
      self.linear_groups=1
      self.head_ratio=2
      self.learning_rate = 3e-4
    elif self.model_size in ["base"]:
      self.generator_hidden_size = 1/3
      self.learning_rate = 2e-4
      self.train_batch_size = 256
      self.eval_batch_size = 256
      self.conv_kernel_size=9
      self.linear_groups=1
      self.head_ratio=2

    # passed-in-arguments override (for example) debug-mode defaults
    self.update(kwargs)

  def update(self, kwargs):
    for k, v in kwargs.items():
      if k not in self.__dict__:
        raise ValueError("Unknown hparam " + k)
      self.__dict__[k] = v

I used the same pre-training data as for training the ELECTRA model.

Cheatsheet is coming in a few days. I'm currently preparing the release of a new ELECTRA model, that was trained on the Turkish part of the recently released mC4 corpus, that has a total size of ~220GB of crawled texts. Training has finished, I'm currently spending some time in updating the readme, and the repo is also getting a new logo :hugs:

stefan-it commented 3 years ago

I've added the convbert cheatsheet incl. the configuration file now :)

finardi commented 3 years ago

Tahnk you so mucg Stefan. This file will help me a lot!!!