Enforce/fixate the syntax/names of primitive types

DavyLandman commented 4 years ago

As we discussed in Athens, we should fix the types a user can pick from during the creation of the ML model.

In the meeting we ended up with this set of primitive types: (rascal syntax for bnf like grammars)

  | "int" // 32bit 
  | "bigint"  // 64bit
  | "string" "(" Int maxSize ")" // "varchar"
  | "text" // unbounded blob
  | "point" // lat & lon point on sphere
  | "polygon" // an ready of multiple points
  | "bool"
  | "float" // IEEE float
  | "blob" // binary unbounded blob
  | "freetext" "[" {Id ","}+ nlpFeature "]" // field with certain features, 
  | "date"
  | "datetime"

jdirocco commented 4 years ago

Hi @DavyLandman It is a first attempt to fix this issue. The following picture shows an example of typhonML model that has been defined by the textual editor: sample syntax

The possible PrimitiveTypes are:

enum PrimitiveDataType returns PrimitiveDataType:
                int = 'int' | 
                bigint = 'bigint' | 
                string = 'string' | 
                text = 'text'|
                point = "point" |
                polygon = "polygon"|
                bool = "bool"|
                float = "float"|
                blob = "blob"|
                date = "date" |
                datetime = "datetime";

The list of NLP task types are:

ParagraphSegmentation
SentenceSegmentation
Tokenisation
PhraseExtractor
NGramExtractor
POSTagging
Lemmatisation
Stemming
DependencyParsing
Chunking
SentimentAnalysis
TextClassification
TopicModelling
TermExtraction
NamedEntityRecognition
RelationExtraction
EventExtraction
CoreferenceResolution

@Danny2097 can you confirm?

I'll apply the graphical editor changes and merge issue35branch to the master one once you confirm that the changes match your needs.

DavyLandman commented 4 years ago

Looks good 👍

The only thing I'm missing is the size of the string field, or is that a separate parameter? (string is like varchar of sql)

Danny2097 commented 4 years ago

@md2manoppello It is my understanding that the list of NlpTasks are from an old version of the Type system developed by a previous researcher. Since joining the project myself and @DrRaja have modified the type system as the old version has some extensibility issues. Many of the tasks still exist but others have been added, refactored or removed. I will post an issue regarding the updated type system.

I don't know how much the modification to the type system will impact ML or QL but I apologise in advance.

DavyLandman commented 4 years ago

for QL it should be okay, as long as we can freeze it soon.

Danny2097 commented 4 years ago

@md2manoppello @DavyLandman

I have added the revised NlpTaskType following a discussion with @DrRaja. They can be found in this Comment https://github.com/typhon-project/typhonml/issues/37#issuecomment-599497608.

Bitico commented 4 years ago

Hello to all. We are refactoring metamodel and language, trying to respect all your considerations/requests. Could this be a solution presented in these figures?

Specifically, note the integer value (optional) in the definition of the strings (line 3 and 12) and the management of the freetext type (line 5).

In this link you can find our tml example: https://www.dropbox.com/s/n4ukdk7c0of7w6c/TmlNewExample.tml?dl=0

Thx for your feedback.

Danny2097 commented 4 years ago

I assuming eng_pol is a workflow name., from an NLP perspective it looks good 👍 .

Bitico commented 4 years ago

@Danny2097 Yes it's the workflow name.

Danny2097 commented 4 years ago

I think it would not make sense to implement. As I would consider the output of a NLP workflow to enrich the field it is attributed too. Using your example the description field would be populated by the output of [eng_pol, CoreferenceResolution]. If a user required the same text to be processed by another workflow then I think they should specify that as an additional field int the ML model.

entity Product {
    id : int
    name : string(32)
    price : float
    description : freetext[eng_pol, CoreferenceResolution]
    sentiment :  freetext[eng_pol_sa, SentimentAnalysis]
 }
...

But you are right by defining freetext using an array users could create models like you proposed. The only other way I can think of doing this would be to use a tuple for freetext.

entity Product {
    id : int
    name : string(32)
    price : float
    description : freetext(eng_pol, CoreferenceResolution)
        sentiment :  freetext(eng_pol_sa, SentimentAnalysis)
 }
...

This would prevent modelers from incorrectly specifying several workflows or NLP tasks and limit them to only specifying a workflowName and NlpTaskType when using the freetext type. I don't know how possible this is or if it is feasible, it is merely a suggestion.

What do you think @DavyLandman, @Bitico and @md2manoppello ?

DavyLandman commented 4 years ago

shouldn't the workflow name be connected to the kind of nlp task?

so for the same freetext field you can send freetext[Sentiment[eng_pol_sa], CoreferenceResolution[eng_pol]] ?

else you can only do a single kind of nlp action per attribute?

Danny2097 commented 4 years ago

That would make more sense.

Bitico commented 4 years ago

Could this be the solution?

Here the complete example: https://www.dropbox.com/s/ifvgdzwmm9dfy3s/TmlNewExample.tml?dl=0

DavyLandman commented 4 years ago

looks good 👍

How about giving string square brackets as well?

Bitico commented 4 years ago

@DavyLandman Sure! no problem for square brackets on string parameter.

DavyLandman commented 4 years ago

@DavyLandman Sure! no problem for square brackets on string parameter.

thanks, just saw that it was in the original request with round brackets, my bad :)

typhon-project / typhonml

Enforce/fixate the syntax/names of primitive types #35