Synonym Replacement Layer - Data Augmentation

aflah02 commented 2 years ago

Since issue #39 is very broad I've created this to specifically discuss the Synonym Replacement Layer

[ ] Parse WordNet
[ ] Work On Integrating With KerasNLP

So far I've made a rough parse here I'll hopefully finalize it in a day or two post which we can have a look at the performance and implementation specifics.

I'll post all relevant updates here

mattdangerw commented 2 years ago

@aflah02 thanks! And yes +1 to opening issues for specific layers like this, rather than the "catch all" issues.

aflah02 commented 2 years ago

@mattdangerw Sorry for the delay, now it's ready here (running instructions are there in the README) and I think it's suitable for our task now as we can extract synonyms easily. Do let me know if I'm missing something. I didn't focus on pos tags much as at the end of the day we'll only need the synonyms to choose randomly from and the literature on synonym replacement also doesn't seem to be using any information from postags

aflah02 commented 2 years ago

@mattdangerw Any reviews for this? If this looks fine I can get started with the layer

mattdangerw commented 2 years ago

Will take a look next Monday!

mattdangerw commented 2 years ago

@aflah02 a few questions...

Would this parser require pandas? Could we avoid the dependency with https://docs.python.org/3/library/csv.html ?
Where's the best documentation for what the wordnet file format looks like?
If a user wanted to supply their own synonym table when creating a layer, how would they support that?

aflah02 commented 2 years ago

@mattdangerw

I think we could. I'll look into it but if we're only concerned with synonyms the 2 dictionaries I've created in the end (in the synonym .py file) can be dumped as jsons and used by reading from those dumps directly
That's what took me the most time to figure out too. I got a brief idea by looking at the Princeton site, it's quite overwhelming at first but it was enough to understand the format the database files had and parse them to tables. I then used another site (2nd last in references) to understand what different pointers mean but didn't really need to use them by the end. The best of explanation for synsets came from the last video I've referenced in the sources and helped me build the function as it clearly explains how the mapping works. The first 10-15 minutes of the video cover this part from what I remember
This is something I've been thinking about too and the best i can think of right now is we could support json dumps so essentially two dumps one telling which word is in which synset and the other telling which synset has which words. We need this as one word can be in multiple synsets as well depending on its sense.

mattdangerw commented 2 years ago

This is something I've been thinking about too and the best i can think of right now is we could support json dumps so essentially two dumps one telling which word is in which synset and the other telling which synset has which words.

A list of list of synsets would be enough to specify everything right? Couldn't you build up the map of individual words to synsets inside the layer be iterating over each synset during layer initialization?

class SynonymReplacement(keras.layers.Layer):
    """Augments input by replacing words with synonyms.

    Args:
        synonym_sets: Either a list of list of words that can be considered
            replacements for each other, or a filepath to a json file containing
            a list of list of string words.
        replacement_rate: The desired rate at which to replace words.
        max_replacements: The maximum number of words to replace

    Examples:

    Basic usage.
    >>> augmenter = keras_nlp.layers.SynonymReplacement(
    ...     synonym_sets: [["dog", "cat", "ox", "moose"]],
    ...     replacement_rate: 0.4,
    ... )
    >>> augmenter(["dog dog dog dog dog"])
    <tf.Tensor: shape=(), dtype=string, numpy=b'cat dog ox dog dog'>

    Using word net.
    >>> augmenter = keras_nlp.layers.SynonymReplacement(
    ...     word_net_lang: "en",
    ...     replacement_rate: 0.4,
    ... )
    >>> augmenter(["they have a big hat"])
    <tf.Tensor: shape=(), dtype=string, numpy=b'they own a large hat'>
    """
    pass

Made a rough sketch of how I am imagining an api could look like. What do you think?

mattdangerw commented 2 years ago

Another key question, how does the EDA implemtation and others handle stemming and lemmatization? E.g. finding a synonym for "hats" vs "hat" or "runs" vs "running".

aflah02 commented 2 years ago

@mattdangerw Thanks for the review!

For the first part we can do this but I won't it be more expensive time complexity wise? So If I understand this correctly for each word we first go over the entire lists of lists and find the list which has the word? This would be in worst case checking every single element right as our target word could be the last word in the last list right? or am I missing something here?

The structure of the API looks pretty good and intuitive!

For the second question this is something which I totally overlooked and we'll need one. However implementing a lemmatizer/stemmer from scratch would be very tedious. I'll have a look into it though but I feel using NLTK would ease off a lot of these issues as currently when I look for runs there is no such word in the dictionary.

aflah02 commented 2 years ago

@mattdangerw After some investigation, it turns out it's not as difficult as it seemed initially. I can try making a lemmatizer that uses WordNet's morphological processing described here

From the brief look I've had at the docs so far I feel it will be pretty similar to the Porter Stemmer (implemented here) which is just a bunch of rules

aflah02 commented 2 years ago

After some further tinkering around and reading the docs I have an Implementation of the Lemmatizer + Synonym Finder here (this is not yet complete as it fails to handle phrases instead of word input but I'm working on it and might miss a few edge cases which need some further work) but now it seems we can do this without any external libraries once this is done Edit: I've handled the edge cases however I'm not too sure how correct the implementation is I feel the paper tries to attempt a recursive implementation so I'll continue to work on that

mattdangerw commented 2 years ago

Lemmatization and stemming are things we may need to add eventually, but hey are kinda a whole can of worms. Ideally we could consider them separately. I also worry about multi-lingual support there.

Is it possible to prepare a wordnet that contains all the morphological forms on each synset? Then we could sidestep these questions here, and still provide a simple layer as described above.

aflah02 commented 2 years ago

Right multilingual support is also something which we need to take care of. I doubt there is such a way without implementing the parser ourselves as there can be tons of words which have the same form after the processing is done on them. I feel we should maybe first work on adding a wordnet API and then we can use it for all sorts of tasks. For now we just need a lemmatizer and a synonym finder. Also if a stemmer is needed the Porter Stemmer is fairly easy to implement (I have one here) and we can use that to see how will the API function as I think the Stemmer and Lemmatizer will have very similar if not identical interfaces just different processing

aflah02 commented 2 years ago

@mattdangerw while we figure this layer out I think I could work on a different EDA operation which does not require WordNet so that we can get an API design in place on how these layers will work, I'll open that as a separate issue

mattdangerw commented 2 years ago

Yeah, that's a good call re starting on deletion.

I am worried about adding a ton of language specific rules based code that we directly ship in our library. If we could push that rules based logic into the preparation of a synonym dataset that's fed into this layer, that's definitely preferable for maintainability and accessibility (we aren't hard coding for languages).

Let's keep discussing here. I need to dig into the weeds of WordNet's morphological processing you linked above.

aflah02 commented 2 years ago

@mattdangerw Yup agreed a good synonym dataset would alleviate all these language specific issues. I'll continue sharing anything useful that I find here

keras-team / keras-hub

Synonym Replacement Layer - Data Augmentation #94