pre-training data sampler

StephennFernandes commented 2 years ago

@craffel , hey there

In the paper at page 3 under 3.2 mT5 section you guys mentioned a data sampling technique that helps to maintain a good balance between low and high resource languages (that prevents models from overfitting from low resource languages, and underfit from large resource languages, by maintaining a probability sampling ratio alpha), by sampling examples according to their probability.

Could you please link me to the specific data sampler code in the mT5 codebase. If you can point me to pre-processing (for pre-training) and pre-training scripts. It will be a great help, as i am trying to pre-train mT5 using the T5 pre-training script from huggingface https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py

craffel commented 2 years ago

Hi, it uses t5.data.rate_num_examples (which itself calls seqio.mixing_rate_num_examples) with a temperature argument: https://github.com/google-research/multilingual-t5/blob/master/multilingual_t5/tasks.py#L33

StephennFernandes commented 2 years ago

@craffel colin, thanks for the reply and help colin. But by any chance would you have any simpler implementation that works with huggingface transformers? Actually i am planning to pretrain using huggingface. refering this: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py tbh tried implementing seqio to huggingface training script but couldn't

craffel commented 2 years ago

No, not that I know of.

StephennFernandes commented 2 years ago

Okay, but would seqio work with huggingface ?

craffel commented 2 years ago

Yep, you can just get seqio to return the tf.data.Dataset and call as_numpy_iterator to get an iterator over examples as normal Python dicts, which can be fed into HF. I don't know of any examples of this being done, though. Maybe @adarob does.

StephennFernandes commented 2 years ago

@adarob could you please show me on how exactly did you use seqio with Huggingface

StephennFernandes commented 2 years ago

@craffel , i belive @adarob is unavailable at the moment, i didn't clearly get the part on calling as_numpy_iterator and on how to fed it into HF. could you please show me some examples on how it could be done perhaps, Please.

craffel commented 2 years ago

No, sorry, this isn't something I can help with.

adarob commented 2 years ago

You should be able to call dataset.as_numpy_iterator() and get an iterator of python dicts. I don't know enough about HF to help from there.

StephennFernandes commented 2 years ago

@adarob , thanks for replying. actually, i have built a hacky way to returning the output from seqio.get_mixture_or_task().get_dataset() as .as_numpy_iterator() which lets me have numpy values.

the follwoing is the code for the same.

import functools

import seqio
import tensorflow as tf
import t5.data
from datasets import load_dataset
from t5.data import postprocessors
from t5.data import preprocessors
from t5.evaluation import metrics
from seqio import FunctionDataSource, utils

TaskRegistry = seqio.TaskRegistry

DEFAULT_OUTPUT_FEATURES = {
    "inputs": seqio.Feature(
        vocabulary=t5.data.get_default_vocabulary(), add_eos=True,
        required=False),
    "targets": seqio.Feature(
        vocabulary=t5.data.get_default_vocabulary(), add_eos=True)
}

def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
    dataset = load_dataset(**dataset_params)
    if shuffle:
        if seed:
            dataset = dataset.shuffle(seed=seed)
        else:
            dataset = dataset.shuffle()
    while True:
        for item in dataset[str(split)]:
            yield item[column]

def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
    return tf.data.Dataset.from_generator(
        functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
        output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
    )

@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
    """Assign the value from the dataset to target_key in key_map"""
    return {**key_map, target_key: x}

dataset_name = 'oscar-corpus/OSCAR-2109'
subset= 'mr'
dataset_params = {"path": dataset_name, "language":subset, "use_auth_token":True}
dataset_shapes = None

TaskRegistry.add(
    "oscar_marathi_corpus",
    source=seqio.FunctionDataSource(
        dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
        splits=("train", "validation"),
        caching_permitted=False,
        num_input_examples=dataset_shapes,
    ),
    preprocessors=[
        functools.partial(
            target_to_key, key_map={
                "inputs": None,
                "targets": None,
            }, target_key="targets"),
        seqio.preprocessors.tokenize,
        # seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features={"targets": seqio.Feature(vocabulary=t5.data.get_default_vocabulary(), add_eos=True)},
    metric_fns=[]
)

dataset = seqio.get_mixture_or_task("oscar_marathi_corpus").get_dataset(
    sequence_length={"inputs": 512, "targets": 512},
    split="train",
    shuffle=True,
    num_epochs=1,
    use_cached=False,
    seed=42
)
for _, ex in zip(range(5), dataset.as_numpy_iterator()):
  print(ex)

but the thing is it returns values as input IDs after the preprocessing done on the dataset. But the Huggingface T5 trainer does take care of all the preprocessing and other steps needed.

I actually need the output in actual raw text string format. which i could then use to preprocess in the huggingface training script. I only need to use the mixture functionality from seqio and avoiding all the preprocessing, tokenization etc.

In summary i only need a way to feed in raw text samples from multiple langugaes, use the mixture from seqio and get back an iterator thats outputs samples which are mixture of all the languages. (in raw text form)

is there a way of actually obtaining that ?

If not then do you know of anyway i could obtain the Mixture functionality without using seqio ?

@patrickvonplaten gently pinging you here, Do you know of any solution to this issue ?

patrickvonplaten commented 2 years ago

Don't really know here, maybe you could try to get some help on the forum: https://discuss.huggingface.co/ ?

cyk1337 commented 2 years ago

Hi, it uses t5.data.rate_num_examples (which itself calls seqio.mixing_rate_num_examples) with a temperature argument: https://github.com/google-research/multilingual-t5/blob/master/multilingual_t5/tasks.py#L33

Hi, multiple languages are firstly fed to various dataloaders and sampled as per rescaling during training, so that each batch only contains one kind of language, is it correct? When doing T5 masking, does it firstly concatenate and split adjacent samples into equal lengths and then doing non-padding masking?

cyk1337 commented 2 years ago

Hi @StephennFernandes, have you found the solution to the data processors? How do you handle sampling for multiple languages?

StephennFernandes commented 2 years ago

@cyk1337 yeah i was able to find a solution. But i haven't yet fully implemented it as i went on pretraining mt5 with t5x trainer over the huggingface t5 trainer.

However, ill get back to huggingface trainer.

the following is the implementation on the same.

import functools

import seqio
import tensorflow as tf
import t5.data
from datasets import load_dataset
from t5.data import postprocessors
from t5.data import preprocessors
from t5.evaluation import metrics
from seqio import FunctionDataSource, utils

TaskRegistry = seqio.TaskRegistry

def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
    dataset = load_dataset(**dataset_params)
    if shuffle:
        if seed:
            dataset = dataset.shuffle(seed=seed)
        else:
            dataset = dataset.shuffle()
    while True:
        for item in dataset[str(split)]:
            yield item[column]

def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
    return tf.data.Dataset.from_generator(
        functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
        output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
    )

@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
    """Assign the value from the dataset to target_key in key_map"""
    return {**key_map, target_key: x}

dataset_name = 'oscar-corpus/OSCAR-2109'
subset= 'mr'
dataset_params = {"path": dataset_name, "language":subset, "use_auth_token":True}
dataset_shapes = None

TaskRegistry.add(
    "oscar_marathi_corpus",
    source=seqio.FunctionDataSource(
        dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
        splits=("train", "validation"),
        caching_permitted=False,
        num_input_examples=dataset_shapes,
    ),
preprocessors=[
functools.partial(
target_to_key, key_map={
"targets": None,
}, target_key="targets")],
    output_features={"targets": seqio.Feature(vocabulary=seqio.PassThroughVocabulary, add_eos=False, dtype=tf.string, rank=0)},
    metric_fns=[]
)

dataset = seqio.get_mixture_or_task("oscar_marathi_corpus").get_dataset(
    sequence_length=None,
    split="train",
    shuffle=True,
    num_epochs=1,
    shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)

to print/iter through samples for the dataset use this:

for _, ex in zip(range(5), dataset):
  print(ex["targets"].numpy().decode())

The above is only for one dataset (one language) , you can generate a seqio mixture that mixes together multiple languages (hf datasets) using a temperature value

refer this code for the same:

seqio.MixtureRegistry.add(
  "multilingual_mix_3",
  ["assamese_span_curruption", "bengali_span_curruption", 
  "bhisnupuriya_span_curruption", "bodo_span_curruption", 
  "divehi_span_curruption", "dogri_span_curruption", 
  "english_span_curruption", "gujarati_span_curruption",
  "hindi_span_curruption", "kannada_span_curruption", 
  "kashmiri_span_curruption", "konkani_span_curruption", 
  "maithili_span_curruption", "malayalam_span_curruption",
  "manipuri_span_curruption", "marathi_span_curruption",
  "nepali_span_curruption", "odia_span_curruption",
  "panjabi_span_curruption", "sanskrit_span_curruption",
  "tamil_span_curruption", "telugu_span_curruption",
   "urdu_span_curruption" ],
  default_rate=3
)
# load the mixture as a dataset. 
dataset = seqio.get_mixture_or_task("multilingual_mix_3").get_dataset(
    sequence_length=None,
    split="train",
    shuffle=True,
    num_epochs=1,
    shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)

# use split="validation" or "test" to load a val/test mixture

you could play around with the default_rate values to see what appropriate values works fine for you. the author of mt5 suggest that defualt value of 3 gave them the best result pretraining on 101 multilingual languges of different sample sizes.

for any additional information kindly refer to the seqio repo, they have great implementation on genrating tasks and mixtures that work with the t5x script. But could be tweaked to be made useful for your own implementations too.

cyk1337 commented 2 years ago

Hey @StephennFernandes, thank you so much for the detailed reply!

Actually I reimplemented one rather than directly using t5 library. Did you mix and pre-sample different languages and use only one dataloader during training? If so, different languages can be concatenated, and the mask can also span across multiple languages. How did you deal with it?

StephennFernandes commented 2 years ago

@cyk1337 hey, i didn't actually think of this. as i was using the t5x Library i assumed that T5x would take care of that. But would be interested to know on how this actually works.

Does the paper say anything about this ?

Gently pinging the authors of mT5 @adarob @craffel Could you guys please help clear this doubt

adarob commented 2 years ago

Each batch contains multiple languages but each concatenated/masked example should be a single language since the concatenation/masking happens before the languages are mixed.

google-research / multilingual-t5

pre-training data sampler #102