[Discussion] Remove some global variables in attribute.go

sneaxiy commented 4 years ago

This issue further addresses: https://github.com/sql-machine-learning/playground/issues/42#issuecomment-645666536

Where are the global variables

Exported global map which stores the parameter docs of models, including https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/attribute.go#L219-L226 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/model_parameters.go#L18-L20 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/model_parameters.go#L227-L229 https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/xgboost_objective_params.go#L18-L20
Exported parameter type definitions, including https://github.com/sql-machine-learning/sqlflow/blob/befcd74558f47a4416e3b7d6fd3130645cf5b813/pkg/attribute/attribute.go#L32-L45

How to remove the exported global map which stores the parameter docs of models

Some of these global variables share the same information. For example:

PremadeModelParamsDocs is deserialized from ModelParameterJSON. So we can remove ModelParameterJSON. Furthermore, ModelParameterJSON is a variable which can be auto generated by python extract_docstring.py > model_parameters.go command. We can do this auto generation using go generate to generate PremadeModelParamsDocs automatically too.
OptimizerParameterDocs is deserialized from OptimizerParameterJSON. So we can remove OptimizerParameterJSON. Furthermore, OptimizerParameterJSON can also be auto generated by python extract_docstring.py > model_parameters.go using go generate.
XGBoostObjectiveDocs is deserialized from XGBoostObjectiveJSON . So we can remove XGBoostObjectiveJSON and keep XGBoostObjectiveDocs.

Note that PremadeModelParamsDocs, OptimizerParameterDocs and XGBoostObjectiveDocs are also used in cli prompt suggestion (see 1, 2, 3). So we cannot remove these 3 variables or hide them.

How to remove exported parameter type definitions

https://github.com/sql-machine-learning/playground/issues/42#issue-640787739 suggests to enhance compile time data checking using the following ways.

var distributedTrainingAttributes = attribute.Dictionary{}.
    Int("train.num_ps", 0, "", nil).
    Int("train.num_workers", 1, "", nil).
    Int("train.worker_cpu", 400, "", nil)

In this way, attribute.Description, attribute.Int, attribute.Float, etc can be hidden. The signature of Dictionary.Int would be:

func (d Dictionary) Int(name string, value int, doc string, checker func(int) error)

The only concern of this method is that we cannot support nil default value. Some of the models may have attributes with nil default values. For example, the default value of num_class in XGBoost model is nil(see here), because only multi-class (>2) classification models need num_class while the other models do not need num_class. And once num_class is provided in SQL WITH statement, it must be an integer number, so the data type of num_class attribute should be int. The meaning of nil default value in SQLFlow is:

If the attribute is provided in WITH statement, SQLFlow would check whether its type is right and call Description.Checker() to check whether it is valid. For example, if num_class is provided in SQL WITH statement, SQLFlow would first check whether the value of num_class is an integer, and call Description.Checker() to check whether it is a positive number.
If the attribute is not provided in WITH statement, nothing would be checked.

We can enhance this method to support nil default value.

Add Default method to set default value. The code is something like: https://github.com/sql-machine-learning/playground/issues/42#issuecomment-645853937 .
```
NewDictionary().Int("num_class", "Number of classes").Default(5)
```
Since the Default method can be used both after Int(...) and Float(...), the input parameter type of Default must be interface{}. Therefore, the signature of Default method should be
```
func (d Dictionary) Default(interface{}) Dictionary
```
In this way, we cannot check whether the default value is of the right type in compile time.
Add both Dictionary.Int and Dictionary.IntOrNil method. The signature of Dictionary.IntOrNil method is like:
```
func (d Dictionary) IntOrNil(name string, doc string, checker func(int) error).
```
In this way, we would double the APIs to Dictionary.
Use optional package to make Dictionary.Int method accepts both int default value and nil default value. The signature of Dictionary.Int would be
```
func (d Dictionary) Int(name string, defaultValue optional.Int, 
                    doc string, checker func(int) error) {
...
}
```

var dict = Dictionary{}. Int("num_class", optional.Int{}, "doc1", checker1). // default value is nil Int("attr_with_default_value", optional.NewInt(0), "doc2", checker2) // default value is 0

wangkuiyi commented 4 years ago

@sneaxiy This issue mixed up two topics -- (1) removing global variables and (2) redesign the API. Let us keep (2) at https://github.com/sql-machine-learning/playground/issues/42 and make this issue focusing on (1).

For (1), I don't see a conclusion -- can we move some of the global variables or not? and why?

I guess the lack of a conclusion is (partly) due to an incomplete overview of the source code. For example, ModelParameterJSON contains attribute names and explanations. attribute.Dictionary also contains such information. Why do we need redundant form of the same piece of information in our code? And, what is the relationship between extract_docstring.py, the Go global variables, and cmd/docgen.go?

sneaxiy commented 4 years ago

@wangkuiyi The conclusion is that we should:

keep PremadeModelParamsDocs, OptimizerParameterDocs and XGBoostObjectiveDocs.
remove ModelParameterJSON, OptimizerParameterJSON and XGBoostObjectiveJSON.
(optional) after we have removed *JSON objects, we can generate PremadeModelParamsDocs and OptimizerParameterDocs automatically instead of hard coding in *.go files.

The reasons are as follows:

Why should we keep PremadeModelParamsDocs, OptimizerParameterDocs and XGBoostObjectiveDocs? These attribute.Dictionary objects are not only used in attribute package but also cli prompt suggestion (see 1, 2, 3). So we cannot remove these attribute.Dictionary objects.
Why should we remove ModelParameterJSON, OptimizerParameterJSON and XGBoostObjectiveJSON? As you have mentioned, these JSON strings would be only used to be deserialized to be attribute.Dictionary (PremadeModelParamsDocs, OptimizerParameterDocs and XGBoostObjectiveDocs) inside attribute package. xxxDocs >= xxxJSON, because xxxJSON only contains docs of TensorFlow and XGBoost models, but xxxDocs may contain models from TensorFlow, XGBoost and sqlflow_models. We do not need redundant information.
[Optional] Why should we generate PremadeModelParamsDocs and OptimizerParameterDocs automatically? As discussed above, PremadeModelParamsDocs and OptimizerParameterDocs are deserialized from ModelParameterJSON and OptimizerParameterJSON. ModelParameterJSON and OptimizerParameterJSON are the doc strings of TensorFlow and XGBoost model APIs, which can be extracted by extract_docstring.py . extract_docstring.py would scan the __doc__ of constructor (or callable model method) of each models from Python source code automatically, and print them into a file (we have printed them into model_parameters.go beforehand). Please see the snapshot of model_parameters.go. That is to say, the hard coding ModelParameterJSON and OptimizerParameterJSON (or say, the hard coding PremadeModelParamsDocs and OptimizerParameterDocs if we have removed ModelParameterJSON and OptimizerParameterJSON ) are also redundant codes because they can be generated automatically by extract_docstring.py.

https://github.com/sql-machine-learning/sqlflow/blob/56868d3cfe274519be8b045e76966f943b031dd0/pkg/attribute/model_parameters.go#L14-L20

sql-machine-learning / sqlflow