Closed aflah02 closed 3 months ago
This is also partly inspired by the ideas mentioned in the GSOC Document
@mattdangerw and rest of the Keras-Team would be great to hear your thoughts on this
As a starting point I've implemented EDA whilst also fixing some of the bugs which are present in the original EDA code such as not excluding stop words in some cases as some issues seem to point out - https://colab.research.google.com/drive/192mGhABi1n51cg8SFLvuUCwvsYIMNvx1?usp=sharing My next step is to show that this achieves the gains mentioned in the paper by training a similar model with one of the datasets used with and without these methods
I've implemented it seems I'm getting the almost 3% gains mentioned in the paper https://github.com/aflah02/Easy-Data-Augmentation-Implementation/blob/main/EDA.ipynb What should be the next step now? @mattdangerw or anyone else from the Keras Team
Thank you very much for digging into this!
Bear with us for a bit here as we figure out an approach we would like to take with data augmentation. We are taking a look and will reply more soon. There's a lot of question this brings up (how to handle static assets, how to make layerized versions of these components, multilingual support). But overall data augmentation is something we would like to explore.
@mattdangerw Sure! I had quite a bit of fun while trying to implement this too, while you guys figure out the way you'd prefer I'll try implementing other techniques too
Backtranslation done on smaller sample size also seems pretty good at giving results: https://github.com/aflah02/BackTranslation-Based-Data-Augmentation Maybe we could do this using parallelized processes to make it faster as for large datasets, it sucks right now so I'll check that out too
If you are looking for something to pick up in the meantime, opened a couple issues tagged with "good first issue" where our design is fully defined. Will try to keep expanding that list.
I recently came across this paper SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness which uses a corruption and reconstruction function to recreate new samples from the real data. This seems to be an interesting one since although it's computationally more expensive than rule based techniques but gives substantial gains on OOD Samples in a couple of datasets. This can be one of the techniques we implement natively for users to use to get gains on OOD Samples maybe
Another great paper (Synthetic and Natural Noise Both Break Neural Machine Translation) which aims to make NMT models more robust to typos and other corruptions which humans can easily overcome. They use 4 techniques which include some interesting ones such as mimicking keyboard typos based on characters in the neighbourhood of another character.
Hi @mattdangerw While I'm working on the other issue too are there any updates on how it's planned to incorporate these DA techniques?
Yes, we have been discussing this. I've been trying to capture some design requirements we have. Here's what I have so far...
1) Augmentation functionality should be exposed as layers not utilities.
2) Layers should take as input raw untokenized string, and output augmented raw untokenized strings.
3) Layers computations should be representable as a tensorflow graph, but this graph can include calls to tf.numpy_function
or tf.py_function
.
4) We do not want to add new dependencies (no using nltk).
5) Layers should be fully usable with user provided data.
For 5) and EDA that means we would need some to represent a synonym lookup table that a user could provide. I'm unsure of what this should look like. Is wordnet the prevalent data source here? Do they have data available in a simple file format (json, text line)?
It would also be helpful to get a bit of the lay of the land here. For the papers mentioned in the survey you linked (https://github.com/styfeng/DataAug4NLP is the github continuously updated version), we should try to get a sense of what techniques are most commonly used. Citations are not a perfect metric, but might be the best place to start.
Hey @mattdangerw Thanks for sharing this. Just to confirm this means that these augmentation layers will always be utilized prior to any operations right? Since they take in and return untokenized inputs? Although I haven't seen any example so far but wouldn't it get tricky if some new work in the future introduced augmenting data during training instead of doing it before which most current works do? Wordnet is quite commonly used for synonym replacement tasks in what I could find and the original papers also used Wordnet. In regards to having a user provide their synonym set we could do that too maybe as a parsed dictionary with keys as words and list of synonyms as values. Someone did parse wordnet and release it as json here. Also in one of the papers I listed below they used English Thesaurus from mytheas component used in LibreOffice project which inturn is created from WordNet however I haven't used this component/project so I don't have much clue how they did it and will have to research that bit
I did try to search for the most cited ones, among the rule based techniques EDA and Synonym Replacement (Character-Level Convolutional Networks for Text Classification) seems to be most cited with 684 and 4079 citations respectively. I think these can be a reasonable place to start while I'll be on the lookout for more cited ones as well for Non Rule Based Techniques too. Also there are a ton of small rule based techniques which are also used depending on usecases and are provided by some other libraries like nlpaug like simulating keyboard or OCR typos and so on. I do have some data sources for these for instance for OCR errors there is this file which common errors.
There is also this for WordNet - https://wordnet.princeton.edu/download/current-version and the format is here - https://wordnet.princeton.edu/documentation/wndb5wn so we can create our own parser and parse it
@aflah02 the augmentations should be applied as operations, but applied before tokenization. A lot of discussion we've had was around 3) above. The layer transformations need to be expressible as a graph of tensorflow ops to work with tf.data. But we believe that doing transformations with purely tf.strings
operations would be too restrictive, so using tf.numpy_function
will allow writing pure python transformations of strings (at a performance hit).
Might be a little simpler to frame this in terms of workflows. A common flow we would expect is something like this...
def preprocess(x):
x = keras_nlp.layers.SomeAugmentation()(x)
x = keras_nlp.tokenizers.SomeTokenizer()(x)
return x
dataset = load text dataset for disk
dataset = dataset.map(preprocess).batch(32)
model = ...
model.fit(dataset) # Each epoch will now apply a different augmentation.
Thanks! Should I now get started with some basic ones which are easy to implement such as EDA following the above scheme?
@aflah02 yeah, I would say maybe rather than EDA maybe we should start with designing a layer for synonym replacement?
It's a strict subset of EDA as a whole, we would want it as a standalone layer, and will start answering a lot of the questions we have.
That sounds good. I'll also get to parsing the WordNet data as well and also try out the data in the GitHub release that I had shared
I'm interested in contributing scripts which allow users to incorporate data augmentation techniques directly without using external libraries. I can start with stuff like synonym replacement, random insertion, random swap, and random deletion from the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks Over time this can be extended to incorporate more techniques as well such as the additional techniques mentioned here Any hints and tips on how I can get started?
Edit: I also found this survey paper which seems pretty useful: A Survey of Data Augmentation Approaches for NLP