rodrigopivi / Chatito

🎯🗯 Dataset generation for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!
https://rodrigopivi.github.io/Chatito/
MIT License
876 stars 157 forks source link

Weighted probability #72

Closed nimf closed 5 years ago

nimf commented 5 years ago

Hi. Thank you for the Chatito!

Our team uses Chatito pretty extensively recently, and we like it.

What I noticed is that it is hard for newcomers to get a sense of what chance of appearing in the training data each sentence will get. E. g.:

%[some_intent]
    ~[alias1]
    ~[alias2] ~[alias3]

Here we can't say what would be the distribution of the examples. Most likely ~[alias2] ~[alias3] will get a bigger share if the amount of variations in the aliases is about the same. But if alias1 has much more variations than alias2 * alias3, then ~[alias1] will get a bigger share. So we have to look up the aliases and go down the nested aliases tree to understand how many variations each sentence might get. This is rather error prone and time-consuming.

So we started to use even distribution a lot and specify probability with probability operator where needed (BTW, it is not that easy to get the total of probabilities to 100, again we have to calculate it and re-adjust).

It helped. But I was thinking, how we can improve that? What if we create some flag for the generation command (e. g. --probability=weighted) If this flag is set, all the sentences will get the same weight of 1, which can be modified with the probability operator. e.g.

// here we have 50%/50% probability for the first and second sentence
%[some_intent]
    ~[alias1]
    ~[alias2] ~[alias3]

// here we have 2:1 ratio of first to second sentence. I. e. 66.66% for the first and 33.33% for the second
%[some_other_intent]
    *[2] ~[alias1]
    ~[alias2] ~[alias3]

I suppose the weighted probability might be even easier to grok because *[2] means you want the amount of this kind of examples to be doubled. So with the "weighted probability" we won't have to set even distribution everywhere and it's easier to modify weights.

What do you think about it? Could this be a valuable addition to Chatito? I'd like to work on a PR for that.

rodrigopivi commented 5 years ago

Hi @nimf,

Thanks for the feedback, this tool is meant to help people who use it and all improvement ideas should be considered.

Being able to switch the sentence generation from the default "regular frequency" distribution to an "even" distribution is a great idea, this setting could be declared at the CLI params or the IDE config before generation (e.g.: --defaultDistribution=even or --defaultDistribution=regular), or at the DSL entity arguments level (e.g.: %[intent]("distribution": "even")), or at both levels, CLI and DSL, to have full control over each entity.

Regarding the probability operator, if the 100 limit as the sum of all probabilities is removed, and float values can be accepted. Then the weighted chances would just behave as documented at ChanceJs lib (https://chancejs.com/miscellaneous/weighted.html), i think that would behave as you described.

Yes, this changes would be valuable. You are most welcome to open a PR with this ideas implemented.

nimf commented 5 years ago

--defaultDistribution looks really good!

Regarding the probability operator, yeah, that would be exactly as described. My only concern is should we keep the percentage probability for regular distribution? Or should we also provide some argument to control that?

// As weights with even distribution
%[intent]("distribution": "even")   // Weight    Resulting percentage
    *[2] ~[alias1]                  // 2         66.66%
    ~[alias2] ~[alias3]             // 1         33.33%

// As percents with regular distribution
%[intent2]("distribution": "regular")  // Resulting percentage
    *[66] ~[alias1]                    // 66%
    ~[alias2] ~[alias3]                // 34%

// As weights with regular distribution
%[intent2]("distribution": "regular")  // Max Count  Resulting Weight  Resulting percentage
    *[2] ~[alias1]                     // 100        200               28.57%
    ~[alias2] ~[alias3]                // 500        500               71.43%
rodrigopivi commented 5 years ago

Good catch, relative weights and percentage probabilities are different things. So maybe changing the name to 'chance operator' might be better than 'probability operator' since the idea is to control the relative weights or the percentage probability.

What do you think of considering the value as a relative weight if there is no '%' symbol, and percentage probability if it comes with %.

Following that idea, then regular distribution would behave like:

%[intent]("distribution": "regular")   // Max Count  | Weight |  Prob
    ~[alias1]                          //     100        100      10%
    ~[alias2] ~[alias3]                //     500        500      50%
    ~[alias4]                          //     400        400      40%
// NOTE: operator with '%' defines the actual probability
%[intent]("distribution": "regular")    // Max Count  | Weight/Prob
    *[20%] ~[alias1]                    //   100            20%
    ~[alias2] ~[alias3]                 //   500            44.4444% // (500*80/900)
    ~[alias4]                           //   400            35.5556% // (400*80/900)
// NOTE: operator without '%' it can just multiply max count as the weight
%[intent]("distribution": "regular")  // Max Count  |  Weight  |  Prob
    *[2] ~[alias1]                    //     100         200       18.1818%
    ~[alias2] ~[alias3]               //     500         500       45.4545%
    ~[alias4]                         //     400         400       36.3636%

And for even:

%[intent]("distribution": "even")       // Max Count  | Weight |  Prob
    ~[alias1]                           //   100           1       33.3333%
    ~[alias2] ~[alias3]                 //   500           1       33.3333%
    ~[alias4]                           //   400           1       33.3333%
%[intent]("distribution": "even")     // Max Count  | Weight | Prob
    *[2] ~[alias1]                    //   100          2       50%
    ~[alias2] ~[alias3]               //   500          1       25%
    ~[alias4]                         //   400          1       25%
%[intent2]("distribution": "even")     // Max Count  | Weight/Prob
    *[20%] ~[alias1]                   //   100              20%
    ~[alias2] ~[alias3]                //   500              40%
    ~[alias4]                          //   400              40%

Let me know your thoughts on this. Also then maybe consider an input error if an entity defines one sentence with %'s and other sentence without %, for consistency.

rodrigopivi commented 5 years ago

Also considering that maybe this adds complexity to the DSL that is not that useful, and only providing even distribution and weighted operator instead of percentage provides overall better datasets and covers the same needs, maybe the only benefit of the current regular frequency distribution implementation is that it may be faster because it won't produce that many duplicates.

nimf commented 5 years ago

What do you think of considering the value as a relative weight if there is no '%' symbol, and percentage probability if it comes with %.

This is awesome! When I was reading the documentation for the probability operator I thought "oh, maybe the percent sign in the end would make it more clear"

Let me know your thoughts on this.

I really like this.

I think regular distribution is helpful in many cases, so we can set it via the distribution argument even when --defaultDistribution=even

Regarding dropping support for percentage probability operator: Personally I like weighted probability more but I can clearly imagine when someone wants "this sentence to fill 30% of all examples and I don't care about the rest 10 sentences"

rodrigopivi commented 5 years ago

Agreed, keeping both strategies then. Just created a dev branch hoping to continue this implementation there. I've updated on that branch the spec to reflect this new features. Please let me know your thoughts on this, so we can coordinate the implementation as I'm hoping to help on it too. Thanks @nimf.

nimf commented 5 years ago

I just read the updated spec.md. It looks really good! So, here is what I think we will need:

  1. Add parsing of "%" inside the probability operator.
  2. Allow alias definitions to have entity arguments.
  3. Implement the defaultDistribution cli argument.
  4. Update the calculation of the weights considering the distribution entity argument (if set) and defaultDistribution configuration.
  5. Expose defaultDistribution to the web editor

I feel like I can do 3 and 4. But I'm open to any suggestions.

rodrigopivi commented 5 years ago

Hi @nimf ,

1 and 2 are done at dev branch. Hope you can rebase your PR to fit the new changes and continue with 3 and 4. Thanks for your help and collaboration.

nimf commented 5 years ago

Awesome! I'll do a rebase and continue to work on 3 and 4 in that branch.

rodrigopivi commented 5 years ago

Published 2.3.0. It was great sharing the work on this Yuri, thanks.