Closed nimf closed 5 years ago
Hi @nimf,
Thanks for the feedback, this tool is meant to help people who use it and all improvement ideas should be considered.
Being able to switch the sentence generation from the default "regular frequency" distribution to an "even" distribution is a great idea, this setting could be declared at the CLI params or the IDE config before generation (e.g.: --defaultDistribution=even
or --defaultDistribution=regular
), or at the DSL entity arguments level (e.g.: %[intent]("distribution": "even")
), or at both levels, CLI and DSL, to have full control over each entity.
Regarding the probability operator, if the 100 limit as the sum of all probabilities is removed, and float values can be accepted. Then the weighted chances would just behave as documented at ChanceJs lib (https://chancejs.com/miscellaneous/weighted.html), i think that would behave as you described.
Yes, this changes would be valuable. You are most welcome to open a PR with this ideas implemented.
--defaultDistribution
looks really good!
Regarding the probability operator, yeah, that would be exactly as described. My only concern is should we keep the percentage probability for regular distribution? Or should we also provide some argument to control that?
// As weights with even distribution
%[intent]("distribution": "even") // Weight Resulting percentage
*[2] ~[alias1] // 2 66.66%
~[alias2] ~[alias3] // 1 33.33%
// As percents with regular distribution
%[intent2]("distribution": "regular") // Resulting percentage
*[66] ~[alias1] // 66%
~[alias2] ~[alias3] // 34%
// As weights with regular distribution
%[intent2]("distribution": "regular") // Max Count Resulting Weight Resulting percentage
*[2] ~[alias1] // 100 200 28.57%
~[alias2] ~[alias3] // 500 500 71.43%
Good catch, relative weights and percentage probabilities are different things. So maybe changing the name to 'chance operator' might be better than 'probability operator' since the idea is to control the relative weights or the percentage probability.
What do you think of considering the value as a relative weight if there is no '%' symbol, and percentage probability if it comes with %.
Following that idea, then regular distribution would behave like:
%[intent]("distribution": "regular") // Max Count | Weight | Prob
~[alias1] // 100 100 10%
~[alias2] ~[alias3] // 500 500 50%
~[alias4] // 400 400 40%
// NOTE: operator with '%' defines the actual probability
%[intent]("distribution": "regular") // Max Count | Weight/Prob
*[20%] ~[alias1] // 100 20%
~[alias2] ~[alias3] // 500 44.4444% // (500*80/900)
~[alias4] // 400 35.5556% // (400*80/900)
// NOTE: operator without '%' it can just multiply max count as the weight
%[intent]("distribution": "regular") // Max Count | Weight | Prob
*[2] ~[alias1] // 100 200 18.1818%
~[alias2] ~[alias3] // 500 500 45.4545%
~[alias4] // 400 400 36.3636%
And for even:
%[intent]("distribution": "even") // Max Count | Weight | Prob
~[alias1] // 100 1 33.3333%
~[alias2] ~[alias3] // 500 1 33.3333%
~[alias4] // 400 1 33.3333%
%[intent]("distribution": "even") // Max Count | Weight | Prob
*[2] ~[alias1] // 100 2 50%
~[alias2] ~[alias3] // 500 1 25%
~[alias4] // 400 1 25%
%[intent2]("distribution": "even") // Max Count | Weight/Prob
*[20%] ~[alias1] // 100 20%
~[alias2] ~[alias3] // 500 40%
~[alias4] // 400 40%
Let me know your thoughts on this. Also then maybe consider an input error if an entity defines one sentence with %'s and other sentence without %, for consistency.
Also considering that maybe this adds complexity to the DSL that is not that useful, and only providing even distribution and weighted operator instead of percentage provides overall better datasets and covers the same needs, maybe the only benefit of the current regular frequency distribution implementation is that it may be faster because it won't produce that many duplicates.
What do you think of considering the value as a relative weight if there is no '%' symbol, and percentage probability if it comes with %.
This is awesome! When I was reading the documentation for the probability operator I thought "oh, maybe the percent sign in the end would make it more clear"
Let me know your thoughts on this.
I really like this.
I think regular distribution is helpful in many cases, so we can set it via the distribution
argument even when --defaultDistribution=even
Regarding dropping support for percentage probability operator: Personally I like weighted probability more but I can clearly imagine when someone wants "this sentence to fill 30% of all examples and I don't care about the rest 10 sentences"
Agreed, keeping both strategies then. Just created a dev
branch hoping to continue this implementation there. I've updated on that branch the spec to reflect this new features. Please let me know your thoughts on this, so we can coordinate the implementation as I'm hoping to help on it too. Thanks @nimf.
I just read the updated spec.md
. It looks really good!
So, here is what I think we will need:
defaultDistribution
cli argument.distribution
entity argument (if set) and defaultDistribution
configuration.defaultDistribution
to the web editorI feel like I can do 3 and 4. But I'm open to any suggestions.
Hi @nimf ,
1 and 2 are done at dev
branch. Hope you can rebase your PR to fit the new changes and continue with 3 and 4. Thanks for your help and collaboration.
Awesome! I'll do a rebase and continue to work on 3 and 4 in that branch.
Published 2.3.0. It was great sharing the work on this Yuri, thanks.
Hi. Thank you for the Chatito!
Our team uses Chatito pretty extensively recently, and we like it.
What I noticed is that it is hard for newcomers to get a sense of what chance of appearing in the training data each sentence will get. E. g.:
Here we can't say what would be the distribution of the examples. Most likely
~[alias2] ~[alias3]
will get a bigger share if the amount of variations in the aliases is about the same. But ifalias1
has much more variations thanalias2
*alias3
, then~[alias1]
will get a bigger share. So we have to look up the aliases and go down the nested aliases tree to understand how many variations each sentence might get. This is rather error prone and time-consuming.So we started to use even distribution a lot and specify probability with probability operator where needed (BTW, it is not that easy to get the total of probabilities to 100, again we have to calculate it and re-adjust).
It helped. But I was thinking, how we can improve that? What if we create some flag for the generation command (e. g.
--probability=weighted
) If this flag is set, all the sentences will get the same weight of 1, which can be modified with the probability operator. e.g.I suppose the weighted probability might be even easier to grok because
*[2]
means you want the amount of this kind of examples to be doubled. So with the "weighted probability" we won't have to set even distribution everywhere and it's easier to modify weights.What do you think about it? Could this be a valuable addition to Chatito? I'd like to work on a PR for that.