keras-team / keras-hub

Pretrained model hub for Keras 3
Apache License 2.0
768 stars 234 forks source link

Sentence Shuffle Layer - Data Augmentation #216

Closed aflah02 closed 2 years ago

aflah02 commented 2 years ago

This issue deals with the Sentence Shuffling technique for Data Augmentation. Essentially for long paragraphs this technique will reorder the sentences present in them to generate new data samples!

I can't think of any args for this though. I first spotted this technique here and the original author also just did a split followed by a random shuffle and rejoin!

Open to any further thoughts on this @chenmoneygithub @mattdangerw

Expected API Design -

class SentenceShuffle(keras.layers.Layer):
    """Augments input by randomly shuffling sentences

    Examples:

    Basic usage.
    >>> augmenter = keras_nlp.layers.SentenceShuffle(
    ... )
    >>> augmenter(["I like cats. He likes dogs."])
    <tf.Tensor: shape=(1,), dtype=string, numpy=[b"He likes dogs. I like cats."]>
    """
    pass
mattdangerw commented 2 years ago

My suspicion is that we should not do this right now. Currently, sentence boundaries are not something we are working with in KerasNLP. The assumption is that users will need to split a dataset into sentences with something like nltk before loading them into a training job. Our BERT example shows this actually.

The first thing to add, if anything, would probably be a sentence splitting utility. Something that allows you to load an entire document with tf.data and split it into sentences on the fly. But I'm not totally sure we would want that either. Sentence boundaries would be another place that could get language specific. We would need to investigate, see if we could come up with a compelling demo that was not going to be overly hard to maintain.

So anyway, overall I would say lets still focus on things like EDA for now!

aflah02 commented 2 years ago

This makes sense Thanks @mattdangerw I'll focus on EDA and related layers for the time being!