rDany / synthetic-data

Synthetic Data repository and compiler
https://rdany.org
GNU General Public License v3.0
0 stars 1 forks source link

Programmable dataset syntax #1

Open EibrielInv opened 7 years ago

EibrielInv commented 7 years ago

With programmable dataset I mean a conversational dataset augmented with labels and modifiers

For example the following code with the modifier [|]:

# Example
H:[¿]qué cosa no [sabes|sabés]?
R:¡hay tantas cosas que no se!
R:no se muchas cosas.

generates the following outputs:

Human:

qué cosa no sabes?
qué cosa no sabes?
qué cosa no sabés?
qué cosa no sabés?
¿qué cosa no sabes?
¿qué cosa no sabes?
¿qué cosa no sabés?
¿qué cosa no sabés?

Robot:

¡hay tantas cosas que no se!
no se muchas cosas.
¡hay tantas cosas que no se!
no se muchas cosas.
¡hay tantas cosas que no se!
no se muchas cosas.
¡hay tantas cosas que no se!
no se muchas cosas.

Scripting

First draft for two steps seq2seq

H:_GREETING Hello
R:_SAD Hi. BECOUSE:_GREETING
R:_NORMAL Hi, how are you? BECOUSE:_GREETING
R:_HAPPY Hi there! :D BECOUSE:_GREETING

H:_BORED I'm bored
R:What do you like to do in your free time? BECOUSE:_BORED

H:_PROGRAM_TIME What time is it?
R:Is one o'clock BECOUSE:_PROGRAM_TIME 01:00
EibrielInv commented 7 years ago

Some ideas

Tags

Error correction and normalization

Language mix

Global values

Sentiment

Mark for retrieval

Example

# Example 
language:en

H:My name is **{name}Eibriel{/name}
R:Nice to meet you {name/}

H:How {typo}r{}are{/typo} {typo}u{}you{/typo}
R:I am fine

H:Where are you {typo}form{}from{/typo}?
R:I am from planet Earth

H:What {typo}{}are{/typo} you doing?
R:I am talking with you

H:What is the meaning of {language:es}Hasta luego{/language}
R:Means "See you later"

H:I'm sad {sadness:0.9}
R:{retrieval}Every cloud has a silver lining{/retrieval}
AlexDvorak commented 7 years ago

{language}word in different language than default{/language}

message indicating sadness{emotion such as: sadness, anger, happiness, concern:emotion level between 0 and 1}

{retrieval}motivational quote or something to cheer someone up{/retrieval}

{name}name of person or pet{/name}

EibrielInv commented 7 years ago

Exactly! :+1:

I'm also thinking on using indentation (similar to python) to generate multiple branches in the conversation. Following the example I posted on Slack:

H:Hi there
    R:How are you?
        H:I'm fine and you?
            R:My systems are working as expected {systems:ok}
        H:I'm sad {sadness:0.9}
            R:What happened?
    R:Long time not see you {days_away:2}
        H:True! Did you miss me?
    R:What do you want? {angry:0.7}
EibrielInv commented 7 years ago

Other question, how to add translations? One way could be:

H:Hi there
H:Buenas
    R:How are you?
    R:¿Cómo estás?
        H:I'm fine and you?
        H:Bien ¿Y tu?
            R:My systems are working as expected {systems:ok}
            R:Mis sistemas están funcionando según lo esperado {systems:ok}
        H:I'm sad {sadness:0.9}
        H:Estoy triste {sadness:0.9}
            R:What happened?
            R:¿Qué sucedió?
    R:Long time not see you {days_away:2}
    R:Hace mucho que no te veo {days_away:2}
        H:True! Did you miss me?
        H:Cierto! Me extrañaste?
    R:What do you want? {angry:0.7}
    R:¿Qué querés? {angry:0.7}
EibrielInv commented 7 years ago

Tag list, and definitions are now on the following file on the code: https://github.com/rDany/synthetic-data/blob/master/synth/modules/data_compiler.py#L10 . Any tag on the dataset that not appear on the list will through an error.

EibrielInv commented 7 years ago

I have been thinking also about tokenization (the division of the dataset on small chunks, like words). I propose that the model only recognizes words marked as tokens at least one time.

For instance:

Every word/token only needs to be identified once on the entire dataset, that is why we don't keep marking the space as token. Also the space is a special token, is muted and that is why it don't appear in the tokenized string array.

This is useful because every language have its own tokenization technique.

For instance is useful for emojis, normally 😃😃😃 will be a single token if we separate tokens by space ["😃😃😃"], but marking the emoji as token in the following way: {token}😃{/token}😃😃 we get ["😃", "😃", "😃"], that is more useful.

On the other hand we might want ' to be a token, like he said 'hi' to be ["he", "said", "'", hi, "'"], but at the same time could be useful to have I'm as a single token, so we can mark it as {token}I'm{/token}. Then something like I'm 'Dany' tokenized as ["I'm", "'", "Dany", "'"] can be possible.

EibrielInv commented 7 years ago

Rethinking dataset for Recurrent Neural Networks with memory. For each "robot" sentence we need to answer a question that assess how correct the answer is. We could name that as an "anchor".

H:Hello{greeting}
R:Hello!!{greeting}
H:My name is Richard
R:Ok!
H:What is my name?{question}
R:Your name is {data}Richard{/data}{answer}
H:I can't sleep{problem}
R:Counting sheep may help{solution}
H:Please speak in english{request}
R:Ok! I will speak in english{acknowledgment}What language will you speak?{/acknowledgment}
H:Could you please tell me how your programming works. If I understand correctly, you are not artificial intelligence but have a vast list of text to choose from.{question}
R:I have an artificial brain, with a neural network. I can generate text word by word.{answer}
EibrielInv commented 7 years ago

Could be reduced to just "Human message" -> "Questions about the answer that should be answered as Yes"

H:Hello
- Is a greeting?

H:I can't sleep
- Is a solution?
- What is the solution for? Is about insomnia?

H:Please speak in english
- Is an acknowledgment?
- What language will you speak? Is that language English?

H:Could you please tell me how your programming works. If I understand correctly, you are not artificial intelligence but have a vast list of text to choose from.
- Do you have an artificial brain?
- Your brain have a neural network?
- Can you generate text word by word?