larray-project / larray

N-dimensional labelled arrays in Python
https://larray.readthedocs.io/
GNU General Public License v3.0
8 stars 6 forks source link

Make it easy to create a class that behaves like a Session #832

Closed alixdamman closed 3 years ago

alixdamman commented 4 years ago

The main idea of the present issue is to benefit from both auto-completion of PyCharm (by defining user classes) and all (nice) features offered by the Session class.

This implies to develop a mechanism to:

Another (big) advantage of using user-defined classes is to reduce significantly the number of global variables and function parameters in the models that are developed in house.

alixdamman commented 4 years ago

@gdementen OK to "try" to implement this for the next release (0.33)? I said "try".

gdementen commented 4 years ago

I had in mind to do a quicker-than-usual release to fix the issues we found during the training (matplotlib import in the viewer, small tutorial/doc changes and the load data >2d without axes names problem), probably as 0.32.1 and then move on to liam2. So I am basically ok with the 0.33 milestone but that will depend when you want to release it.

alixdamman commented 4 years ago

If we plan to do a 0.32.1 release, I think there is no reason to make 0.33 release soon.

gdementen commented 4 years ago

Or... If you want to work on it before I am able to (probably somewhere in January), I give you the experimental code I have so far and you do it. So far, I have always linked this "static session" thing to the lazy session feature (see #727), but maybe we could dissociate the two. In that case it's pretty easy. The lazy part is a lot more tricky (when I say tricky, it's not necessarily hard to implement, but rather hard to figure out which behaviour is best for users).

alixdamman commented 4 years ago

Related to issue #727

alixdamman commented 4 years ago

See comment https://github.com/larray-project/larray/pull/840#issuecomment-567452278 for an example of how the StaticSession class should be used.

gdementen commented 4 years ago

This is at least how I saw things. There are probably other options. I just don't think restricting values based on values defined at the class level is a good idea. It is just too weird in the Python world, while there are alternatives which are more usual (restrict on types defined at the class level) or restrict on values defined at the instance level (i.e. values defined in __init__)

alixdamman commented 4 years ago

In the Demo model, they have a Parameters class in which they define all axes, groups and scalars of the model. The other classes contains only arrays but inherits from the Session class to mainly take benefit of its I/O methods. I tried to somewhat "reproduce" what is done in that model (consider all axes and groups defined in a class as frozen) plus to include the idea of ArrayDef but I see now that implementing a class just based on one model is wrong.

The only thing that bothered me with the proposed "specifications" of the StaticSession is that all axes used in statements like array_var = ArrayDef(axes) must be defined before the definition of the class and they cannot depend on any variable with value defined at runtime --> imagine a TIME axis defined as TIME = Axis(f'time={first_proj_year}..{last_proj_year}') where first_proj_year and last_proj_year are defined at runtime.

I wonder if there an easy way to also "freeze" the axes of an Array variables defined in the __init__ ? method?

gdementen commented 4 years ago

We can discuss this when I get back in January, if you want.

On Fri, Dec 20, 2019 at 2:41 PM Alix Damman notifications@github.com wrote:

In the Demo model, they have a Parameters class in which they define all axes, groups and scalars of the model. The other classes contains only arrays but inherits from the Session class to mainly take benefit of its I/O methods. I tried to somewhat "reproduce" what is done in that model (consider all axes and groups defined in a class as frozen) plus to include the idea of ArrayDef but I see now that implementing a class just based on one model is wrong.

The only thing that bothered me with the proposed "specifications" of the StaticSession is that all axes used in statements like array_var = ArrayDef(axes) must be defined before the definition of the class and they cannot depend on any variable with value defined at runtime --> imagine a TIME axis defined as TIME = Axis(f'time={first_proj_year}..{last_proj_year}') where first_proj_year and last_proj_year are defined at runtime.

I wonder if there an easy way to also "freeze" the axes of an Array variables defined in the init ? method?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/larray-project/larray/issues/832?email_source=notifications&email_token=AAGLEXHOOFIPWSYKAPMG2Z3QZTDRVA5CNFSM4JS46LZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHM6K2Q#issuecomment-567928170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGLEXAQJQKHHYU5LO7HGM3QZTDRVANCNFSM4JS46LZA .

alixdamman commented 4 years ago

I wonder if we couldn't try to use the pydantic to fix this issue ? This may help us to make classes inheriting from ConstrainedSession more compatible with the broader Python ecosystem. FWIW, a PyCharm plugin exists for pydantic.

alixdamman commented 4 years ago

@gdementen I already have some proof of concept code for the implementation of the ConstrainedSession class using pydantic.BaseModel (see below):

TO_SET_LATER = object()

try:
    import pydantic
except ImportError:
    pydantic = None

if pydantic and sys.version_info >= (3, 6):

    class ConstantAxes(ABCArray):
        def __init__(self, axes=None):
            if axes is not None and not isinstance(axes, AxisCollection):
                axes = AxisCollection(axes)
            self.axes = axes

    class ConstrainedSession(Session, pydantic.BaseModel):
        """
        Warnings
        --------
        The :py:method:`ConstrainedSession.filter`, :py:method:`ConstrainedSession.compact`
        and :py:method:`ConstrainedSession.apply` methods return a simple Session object.
        The type of the declared variables (and the value for the declared constants) will
        no longer by checked.

        Examples
        --------

        >>> class ModelVariables(ConstrainedSession):
        ...     # declare variables with default initialization value.
        ...     FIRST_OBS_YEAR: int = 1991
        ...     LAST_PROJ_YEAR: int = 2070
        ...     AGE: Axis = Axis('age=0..120')
        ...     GENDER: Axis = Axis('gender=male,female')
        ...     G_CHILDREN: Group = AGE[:17]
        ...     G_ADULTS: Group = AGE[18:]
        ...     # declare variables without default value
        ...     # Such variables MUST be initialized at instantiation
        ...     # as argument to the constructor or in the __post_init__ method
        ...     variant_name: str
        ...     first_proj_year: int
        ...     obs_years: Axis
        ...     proj_years: Axis
        ...     population_obs: Array
        ...     population_proj: Array
        ...     # declare arrays with constant axes
        ...     mortality_rate: Array = ConstantAxes((AGE, GENDER))
        ...     # declare variables with a value to be defined after instantiation
        ...     # TO BE USED WITH EXTREME CAUTION
        ...     later: str = TO_SET_LATER
        ...
        ...     # method to initialize variables defined at runtime
        ...     def __post_init__(self):
        ...         self.obs_years = Axis(f'time={self.FIRST_OBS_YEAR}..{self.first_proj_year-1}')
        ...         self.proj_years = Axis(f'time={self.first_proj_year}..{self.LAST_PROJ_YEAR}')
        ...         self.population_obs = zeros((self.AGE, self.GENDER, self.obs_years))
        ...         self.population_proj = zeros((self.AGE, self.GENDER, self.proj_years))
        ...         self.mortality_rate = zeros((self.AGE, self.GENDER))

        >>> def run_model(variant_name, first_proj_year):
        ...     # create an instance of the ModelVariables class
        ...     m = ModelVariables(variant_name=variant_name, first_proj_year=first_proj_year)
        ...     m.later = 'later'
        ...     # set arrays
        ...     # ==== model ====
        ...     # some code here
        ...     # ...
        ...     # ==== output ====
        ...     # save all variables in an HDF5 file
        ...     m.save(f"{variant_name}.h5", display=True)

        Content of file 'main.py'

        >>> run_model('projection_2020_2070', first_proj_year=2020)
        dumping FIRST_OBS_YEAR ... done
        dumping LAST_PROJ_YEAR ... done
        dumping AGE ... done
        dumping GENDER ... done
        dumping G_CHILDREN ... done
        dumping G_ADULTS ... done
        dumping variant_name ... done
        dumping first_proj_year ... done
        dumping obs_years ... done
        dumping proj_years ... done
        dumping population_obs ... done
        dumping population_proj ... done
        dumping mortality_rate ... done
        """
        class Config:
            # whether to validate field defaults (default: False)
            validate_all = True
            # whether to perform validation on assignment to attributes
            validate_assignment = True
            # a dict used to override the default error message templates.
            # Pass in a dictionary with keys matching the error messages you want to override.
            error_msg_templates = {}

        def __init__(self, *args, **kwargs):
            meta = kwargs.pop('meta', Metadata())

            # whether to ignore, allow, or forbid extra attributes during model initialization (and after).
            # Accepts the string values of 'ignore', 'allow', or 'forbid',
            # or values of the Extra enum (default in pydantic: Extra.ignore)
            self.__config__.extra = kwargs.pop('extra', 'allow')
            # whether or not models are faux-immutable, i.e. whether __setattr__ is allowed.
            # (default in pydantic: True)
            self.__config__.allow_mutation = kwargs.pop('allow_mutation', True)

            # need to call Session.__init__() to not fall into an infinite loop later
            Session.__init__(self, meta=meta)

            if len(args) == 1:
                assert len(kwargs) == 0
                a0 = args[0]
                if isinstance(a0, str):
                    # assume a0 is a filename
                    self.load(a0)
                else:
                    # iterable of tuple or dict-like
                    self.update(a0)
            else:
                self.add(*args, **kwargs)

            pydantic.BaseModel.__init__(self, **self.items())
            self.__post_init__(**kwargs)

        def __post_init__(self, **kwargs):
            """Override this method to initialize variables for which values are known only at runtime."""
            pass

        def load(self, fname, names=None, engine='auto', display=False, **kwargs):
            # find a way to skip constant and save computation time?
            super().load(fname, names, engine, display, **kwargs)
            # prefix = f"A value for a variable named '{key}' was found in file '{loaded_from_file}' while "
            # warnings.warn(f"{prefix}'{key}' is not declared in '{self.__class__.__name__}'", stacklevel=2)

        def save(self, fname, names=None, engine='auto', overwrite=True, display=False, **kwargs):
            for key, value in self.items():
                if value is TO_SET_LATER:
                    warnings.warn(f"The variable '{key}' is declared in the '{self.__class__.__name__}' "
                                  f"class definition but was not set.")
            super().save(fname, names, engine, overwrite, display, **kwargs)

        def __setitem__(self, key, value, skip_check_value=False, loaded_from_file=None):
            # we need to keep the attribute in sync (initially to mask the class attribute)
            pydantic.BaseModel.__setattr__(self, key, value)
            self._objects[key] = value

        def __setattr__(self, key, value, skip_check_value=False, loaded_from_file=None):
            # update the real attribute
            pydantic.BaseModel.__setattr__(self, key, value)
            # update self._objects
            Session.__setattr__(self, key, value)

        @pydantic.validator('*')
        def validate_variables(cls, v, field):
            key = field.name
            # --- array with constant axes
            if key in cls.__field_defaults__ and isinstance(field.value, ConstantAxes):
                try:
                    field.value.axes.check_compatible(v.axes)
                except ValueError as error:
                    msg = str(error).replace("incompatible axes:", f"incompatible axes for array '{key}':")\
                        .replace("vs", "was declared as")
                    raise ValueError(msg)
            return v
gdementen commented 4 years ago

@gdementen I already have some proof of concept code for the implementation of the ConstrainedSession class using pydantic.BaseModel (see below):

There are a few things I don't like with this approach...


TO_SET_LATER = object()

try:
    import pydantic
except ImportError:
    pydantic = None

Either we really depend on it (make it non-optional) or we allow something like:

class MySession(ConstrainedSession, pydantic.BaseMolde):
    ...

but I would rather not make ConstrainedSession rely on an optional dependency.

    >>> class ModelVariables(ConstrainedSession):
    ...     # declare variables with default initialization value.
    ...     FIRST_OBS_YEAR: int = 1991

Can't we avoid specifying int in this case? It feels redundant.

    ...     LAST_PROJ_YEAR: int = 2070
    ...     AGE: Axis = Axis('age=0..120')
    ...     # declare variables without default value
    ...     # Such variables MUST be initialized at instantiation
    ...     # as argument to the constructor or in the __post_init__ method
    ...     first_proj_year: int
    ...     obs_years: Axis
    ...     population_proj: Array
    ...     # declare arrays with constant axes
    ...     mortality_rate: Array = ConstantAxes((AGE, GENDER))

This is the main point where I think the approach fails. pydantic is very similar to the typing module which is itself not flexible enough for us yet (see my comment at https://github.com/larray-project/larray/pull/840#discussion_r364230103), in that you can provide types or constraints to be validated (a type is just a particular kind of constraint). The thing here is that you provide constraints sometimes via the type (using :) and sometimes via the value (after the =). This is too confusing IMO. I think we should either provide all constraints via the value, or via the type, but not a mix of both.

    ...     # declare variables with a value to be defined after instantiation
    ...     # TO BE USED WITH EXTREME CAUTION
    ...     later: str = TO_SET_LATER

Hu? I don't understand the difference between this and the variable with only a type like above (variant_name, etc.)??

    ...
    ...     # method to initialize variables defined at runtime
    ...     def __post_init__(self):

Having to define yet an extra method? Isn't this getting too complex for our users?

alixdamman commented 4 years ago

There are a few things I don't like with this approach...

No kidding!

I would rather not make ConstrainedSession rely on an optional dependency

OK. But I wanted to make something similar to open_excel returning an error if xlwings is not installed. I don't see why xlwings is optional (and then open_excel raises an error) but pydantic should not be optional.

Anyway, I proposed to use pydantic because I took a look recently of the code of pandasSDMX which relies on pydantic to implement SDMX concepts. I guess I will use either pandasSDMX (at least the model.py module of it) or pydantic for the SDMX project.

Can't we avoid specifying int in this case? It feels redundant.

No, pydantic propose something similar to @dataclass via its ModelBase class. pydantic also propose its own version of @dataclass decorator (see here). pydantic is supposed to work well with mypy (see here).

This is the main point where I think the approach fails. pydantic is very similar to the typing module which is itself not flexible enough for us yet (see my comment at #840 (comment)), in that you can provide types or constraints to be validated (a type is just a particular kind of constraint). The thing here is that you provide constraints sometimes via the type (using :) and sometimes via the value (after the =). This is too confusing IMO. I think we should either provide all constraints via the value, or via the type, but not a mix of both.

In the DEMO model, I encouraged user to use annotations in their functions especially to allow auto-completion on methods and attributes of passed arguments. Within PyCharm (with Python 3.6+), writing:

def foo(arr: Array)
      arr.

will end up with a dropdown box with all methods and attributes of Array objects. My idea here is to encourage users in general to use annotations everywhere. Was that a bad idea ?

With the string syntax of LArray we already allow users to write code less compatible with the broader Python ecosystem. If we choose the = operator to specify types (or pseudo-types), we move further away from broader Python ecosystem. Besides, using the : in functions and = in constrained classes will be even more confusing.

...     # declare variables with a value to be defined after instantiation
...     # TO BE USED WITH EXTREME CAUTION
...     later: str = TO_SET_LATER

Hu? I don't understand the difference between this and the variable with only a type like above (variant_name, etc.)??

pydantic raises an error if not all declared attributes are set with a value. After instantiation, an object is supposed to be fully set up with pydantic. TO_SET_LATER is just a trick to bypass this behavior.

   ...     # method to initialize variables defined at runtime
   ...     def __post_init__(self):

Having to define yet an extra method? Isn't this getting too complex for our users?

__post_init__ comes from the Python 3.7 Dataclass actually. The Dataclass has a __post_init__ method called at the end of the generated __init__ by default (see here).

gdementen commented 4 years ago

I would rather not make ConstrainedSession rely on an optional dependency

OK. But I wanted to make something similar to open_excel returning an error if xlwings is not installed. I don't see why xlwings is optional (and then open_excel raises an error) but pydantic should not be optional.

Anyway, I proposed to use pydantic because I took a look recently of the code of pandasSDMX which relies on pydantic to implement SDMX concepts. I guess I will use either pandasSDMX (at least the model.py module of it) or pydantic for the SDMX project.

The PandaSDMX (there is no second s) API is awful. I just hope you will not mimic it.

Can't we avoid specifying int in this case? It feels redundant.

No, pydantic propose something similar to @dataclass via its ModelBase class. pydantic also propose its own version of @dataclass decorator (see here). pydantic is supposed to work well with mypy (see here).

From pydantic front page example:

class User(BaseModel):
    id: int
    name = 'John Doe'

then:

name is inferred as a string from the provided default; because it has a default, it is not required.

So I assume it is possible. If it isn't in our case, please explain why.

This is the main point where I think the approach fails. pydantic is very similar to the typing module which is itself not flexible enough for us yet (see my comment at #840 (comment)), in that you can provide types or constraints to be validated (a type is just a particular kind of constraint). The thing here is that you provide constraints sometimes via the type (using :) and sometimes via the value (after the =). This is too confusing IMO. I think we should either provide all constraints via the value, or via the type, but not a mix of both.

In the DEMO model, I encouraged user to use annotations in their functions especially to allow auto-completion on methods and attributes of passed arguments. Within PyCharm (with Python 3.6+), writing: [...] Was that a bad idea ?

I never said a word against that... But isn't it possible to define a custom data type (https://pydantic-docs.helpmanual.io/usage/types/#custom-data-types ) instead of using ConstantAxes, so that we can provide the constraint via a type (:), instead of via =?

With the string syntax of LArray we already allow users to write code less compatible with the broader Python ecosystem. If we choose the = operator to specify types (or pseudo-types), we move further away from broader Python ecosystem. Besides, using the : in functions and = in constrained classes will be even more confusing.

That's why I have always pushed against using it outside of an interactive console.

...     # declare variables with a value to be defined after instantiation
...     # TO BE USED WITH EXTREME CAUTION
...     later: str = TO_SET_LATER

Hu? I don't understand the difference between this and the variable with only a type like above (variant_name, etc.)??

pydantic raises an error if not all declared attributes are set with a value. After instantiation, an object is supposed to be fully set up with pydantic. TO_SET_LATER is just a trick to bypass this behavior.

Could you explain why a user would need to set values after instantiation?

Having to define yet an extra method? Isn't this getting too complex for our users?

__post_init__ comes from the Python 3.7 Dataclass actually. The Dataclass has a __post_init__ method called at the end of the generated __init__ by default (see here).

This does not change anything to my statement. I would rather not explain that to our users if we can avoid it.

alixdamman commented 4 years ago

From pydantic front page example:

class User(BaseModel): id: int name = 'John Doe'

then:

name is inferred as a string from the provided default; because it has a default, it is not required.

So I assume it is.

Yes it is. You're totally right. I completely miss that.

I never said a word against that... But isn't it possible to define a custom data type (https://pydantic-docs.helpmanual.io/usage/types/#custom-data-types ) instead of using ConstantAxes, so that we can provide the constraint via a type (:), instead of via =?

I made a quick test with pydantic during the week-end. I didn't went so far in the documentation of pydantic. I will take a look.

pydantic raises an error if not all declared attributes are set with a value. After instantiation, an object is supposed to be fully set up with pydantic. TO_SET_LATER is just a trick to bypass this behavior.

Could you explain why we need that?

pydantic requires to set all fields at instantiation. This means our users will have to define all fields at instantiation. This maybe a bit too restrictive for our users. Besides, I don't know what would happen if a user class is loaded from several files. I just renamed NOT_LOADED by TO_SET_LATER.

This does not change anything to my statement. I would rather not explain that to our users if we can avoid it.

In that case, we need to define the TO_SET_LATER (or NOT_LOADED) value.

Although, I can understand that we want to make the life of our users easier. I see LArray as something somewhere between a new "modelling language" and a Python library. But this requires for us to make much more effort to develop LArray than to develop any true Python lib (all the work you did for the string syntax was amazing). I would like LArray to be more used internally but that's another discussion.

Maybe the risk of adding pydantic as a new dependency is too high if we just use it for the ConstrainedSession (I am thinking of the "Dependency Hell"). It is OK if you think it is. Even so, I would to see if there is possibility to use the : notation instead of = notation because I would like to encourage users (those having Python 3.6+ installed) to use annotations (in functions and classes).

gdementen commented 4 years ago

I never said a word against that... But isn't it possible to define a custom data type (https://pydantic-docs.helpmanual.io/usage/types/#custom-data-types ) instead of using ConstantAxes, so that we can provide the constraint via a type (:), instead of via =?

I made a quick test with pydantic during the week-end. I didn't went so far in the documentation of pydantic. I will take a look.

For me this question is crucial. If pydantic allows us to do "the right thing" (i.e. annotate classes -- and possible other code -- with arrays with defined axes) it is worth adding it as a dependency. If it does not, I am unsure it is worth the added trouble.

pydantic raises an error if not all declared attributes are set with a value. After instantiation, an object is supposed to be fully set up with pydantic. TO_SET_LATER is just a trick to bypass this behavior.

Could you explain why we need that?

pydantic requires to set all fields at instantiation. This means our users will have to define all fields at instantiation. This maybe a bit too restrictive for our users. I don't think so, as long as the behavior is limited to ConstrainedSession, not to all sessions.

Besides, I don't know what would happen if a user class is loaded from several files.

That's indeed going to be problematic but I think that in that case we would need to provide a solution, not our users. Eg. provide a flag when loading such a file to tell it is partial/incomplete and at that point set all other fields to TO_SET_LATER_OR_WHATEVER_WE_CALL_IT 😉.

This does not change anything to my statement. I would rather not explain that to our users if we can avoid it.

In that case, we need to define the TO_SET_LATER (or NOT_LOADED) value.

Although, I can understand that we want to make the life of our users easier. I see LArray as something somewhere between a new "modelling language" and a Python library. But this requires for us to make much more effort to develop LArray than to develop any true Python lib (all the work you did for the string syntax was amazing). I would like LArray to be more used internally but that's another discussion.

Maybe the risk of adding pydantic as a new dependency is too high if we just use it for the ConstrainedSession (I am thinking of the "Dependency Hell"). It is OK if you think it is. Even so, I would to see if there is possibility to use the : notation instead of = notation because I would like to encourage users (those having Python 3.6+ installed) to use annotations (in functions and classes).

A new dependency is always a tradeoff. If we can add a custom type which can validate axes AND work in PyCharm et al. that would be awesome and more than worth it. Otherwise, I am unsure it would be worth it.

alixdamman commented 4 years ago

Alright.

Either we choose to use pydantic or not, I have a few more suggestions and questions:

Suggestions:

  1. I would to add a boolean keyword argument allow_extra_vars to whether or not allow extra variables during and after model initialization. Defaults to True.
  2. I would to add a boolean keyword argument allow_mutation to whether or not tell the constrained class is faux-immutable, i.e. whether setattr and setitem are allowed. Defaults to True. This argument could be interesting to create "Parameters" class such as the one defined in the DEMO model.

Questions:

  1. The filter(), compact() and apply() methods of ConstrainedSession will return a normal Session object. Is putting a Warning section in the doctsring of ConstrainedSession enough or should I print a warning when one of these methods is called. I guess a Warning section is enough but do you think our users will read that section?
  2. What to do with binary operations (eq and neq excluded)?
    • a) Always return a normal Session object silently
    • b) Always return a normal Session object and print a warning
    • c) Always return a ConstrainedSession object and raise an error in case of any incompatibility (extra arrays, ...)
    • d) Return a ConstrainedSession object if possible. If not, return a normal Session object and print a warning.
  3. Same question for unary operations

Personally, I would go for option d) or b) for the questions 2 and 3. I think binary operations is rarely used by our users but IMO returning a normal Session object should not be done silently in this case.

alixdamman commented 4 years ago

For me this question is crucial. If pydantic allows us to do "the right thing" (i.e. annotate classes -- and possible other code -- with arrays with defined axes) it is worth adding it as a dependency. If it does not, I am unsure it is worth the added trouble.

What you are saying is that you want something like:

class MyDemo(ConstrainedSession):
    pop: ArrayDef("age=0..120; gender=female,male; time=1991..2020")

?

It sounds like it is not possible with the typing module nor with pydantic.

But:

  1. There is a little problem with the current implementation of ArrayDef. This class has only an __init__() method taking axes as argument. So when instantiating MyDemo class as demo = MyClass(pop=zeros("age=0..120; gender=female,male; time=1991..2020")), writing pop. later will not popup the methods of the Array. This is because PyCharm thinks that pop is still an ArrayDef object. This is not an argument to not use the var = type notation but the problem is still there and needs to be fixed one way or another.
  2. The operator = is commonly used to define a default value and : to define a type. That is what is used in functions (and traditional dataclasses) and that is what PyCharm relies on to propose auto-completion. So, I would prefer if we if we stick to that (after all).

So, what we may do is to introduce a subclass ArrayDef implemented as something like:

class ArrayDef(Array):
   def __inti__(data=None, axes=None, dtype=None, meta=None):
         if axes is not None and not isinstance(axes, AxisCollection):
             axes = AxisCollection(axes)
         if data is None:
             data = np.empty(axes.shape, dtype)
         Array.__init__(self, data, axes, dtype, meta)

and then:

from pydantic import BaseModel

class ConstrainedSession(Session, BaseModel):
        ...
        @pydantic.validator('*')
        def validate_variables(cls, v, field):
            key = field.name
            # --- array with constant axes
            if isinstance(field.type_, ArrayDef):
                # no axes given in class definition
                if not field.value:
                    # first initialization
                    if key is not in self.keys()
                        return v
                    else:
                       # compare with current value
                       try:
                            self[key].axes.check_compatible(v.axes)
                       except ValueError as error:
                            msg = str(error).replace("incompatible axes:", f"incompatible axes for array '{key}':")\
                                .replace("vs", "was declared as")
                            raise ValueError(msg)
                       return v
                # axes given in class definition
                else:
                    try:
                        field.value.axes.check_compatible(v.axes)
                    except ValueError as error:
                        msg = str(error).replace("incompatible axes:", f"incompatible axes for array '{key}':")\
                            .replace("vs", "was declared as")
                        raise ValueError(msg)
                    return v

and then:

class MyDemo(ConstrainedSession):
    # axes given in class definition
    pop = ArrayDef("age=0..120; gender=female,male; time=1991..2020")
    # axes given at instantiation
    births: ArrayDef

demo = MyDemo(pop=zeros("age=0..120; gender=female,male; time=1991..2020"),
                             births=zeros("age=0..120; gender=female,male; time=1991..2020"))
gdementen commented 4 years ago

For me this question is crucial. If pydantic allows us to do "the right thing" (i.e. annotate classes -- and possible other code -- with arrays with defined axes) it is worth adding it as a dependency. If it does not, I am unsure it is worth the added trouble.

What you are saying is that you want something like:

class MyDemo(ConstrainedSession):
    pop: ArrayDef("age=0..120; gender=female,male; time=1991..2020")

Indeed. Or, in fact, I would even prefer this (the spelling does not matter that much but the structure does):

class MyDemo(ConstrainedSession):
     gender = Axis("gender=female,male")
     time: Axis
     pop: ArrayDef([gender, time])

def init_demo():
    time = Axis('time=1991..2020')
    return MyDemo(time=time, pop=zeros([MyDemo.gender, time])

It sounds like it is not possible with the typing module nor with pydantic.

I knew about typing but I had hopes for pydantic... Too bad.

But:

1. There is a little problem with the current implementation of `ArrayDef`. This class has only an `__init__()` method taking axes as argument. So when instantiating `MyDemo` class as `demo = MyClass(pop=zeros("age=0..120; gender=female,male; time=1991..2020"))`, writing `pop.` later will not popup the methods of the `Array`. This is because PyCharm thinks that `pop` is still an ArrayDef object. This is not an argument to not use the `var = type` notation but the problem is still there and needs to be fixed one way or another.

2. The operator `=` is commonly used to define a default and `:` to define a type. That is what is used in functions (and traditional dataclasses) and that is what PyCharm relies on to propose auto-completion. So, I would prefer if we if we stick to that (after all).

Indeed. And I would prefer that too (that's the whole point of my comments), but then what you show do not respect that, so I am confused 😕 ...

So, what we may do is to introduce a subclass ArrayDef implemented as something like:


class ArrayDef(Array):
   def __inti__(data=None, axes=None, dtype=None, meta=None):
         if axes is not None and not isinstance(axes, AxisCollection):
             axes = AxisCollection(axes)
         if data is None:
             data = np.empty(axes.shape, dtype)

Making ArrayDef a subclass of Array should be enough. You do not actually need any data. To trick PyCharm/static analyzers, we do not need the object to be functional. We will not use them anyway.

     Array.__init__(self, data, axes, dtype, meta)

This is not needed either AFAICT.

and then:


class MyDemo(ConstrainedSession):
    # axes given in class definition
    pop = ArrayDef("age=0..120; gender=female,male; time=1991..2020")

Here you use = for a type => I am confused...

gdementen commented 4 years ago

Alright.

Either we choose to use pydantic or not, I have a few more suggestions and questions:

Suggestions:

1. I would to add a boolean keyword argument `allow_extra_vars` to whether or not allow extra variables during and after model initialization. Defaults to True.

I fear an option nobody will use. The option could come later if we have demand for it.

2. I would to add a boolean keyword argument `allow_mutation` to whether or not tell the constrained class is faux-immutable, i.e. whether **setattr** and **setitem** are allowed. Defaults to True. This argument could be interesting to create "Parameters" class such as the one defined in the DEMO model.

I fear an option nobody will use. The option could come later if we have demand for it OR we could provide a built-in subclass (e.g. Parameters) with that option enabled.

Questions:

1. The `filter()`, `compact()` and `apply()` methods of `ConstrainedSession` will return a normal `Session` object. Is putting a Warning section in the doctsring of ConstrainedSession enough or should I print a warning when one of these methods is called. I guess a Warning section is enough but do you think our users will read that section?

They will not read that section, but a dynamic warning would be annoying, so I would only warn in the doc and wait for users to discover it (or we might show one such example of Constrainted-> normal session in the tutorial section about ConstraintedSession)

2. What to do with binary operations (**eq** and **neq** excluded)?

   * a) Always return a normal `Session` object silently
   * b) Always return a normal `Session` object and print a warning
   * c) Always return a `ConstrainedSession` object and raise an error in case of any incompatibility (extra arrays, ...)
   * d) Return a `ConstrainedSession` object if possible. If not, return a normal `Session` object and print a warning.

I don't really know. I don't like option c) because this severely limits what you can do, but I am unsure about the other options.

3. Same question for unary operations

Personally, I would go for option d) or b) for the questions 2 and 3. I think binary operations is rarely used by our users but IMO returning a normal Session object should not be done silently in this case.

alixdamman commented 4 years ago

The PandaSDMX (there is no second s) API is awful. I just hope you will not mimic it.

No, that's not what I have in mind. I only wanted to be able to import the model.py script from pandasSDMX (which implements the so-called SDMX Information-Model (https://sdmx.org/?page_id=5008 > SDMX 2.1 > Section 2 - Information Model)) but after a second thought I think this is silly to add a dependency just for one script. I will write my own implementation of the SDMX-IM but also using pydantic because it has an added value in this case.

I have started to read the official technical documentation of the SDMX-IM. The document is messy. There discrepancies between the class diagrams and the definitions sections. Testing the code is gonna be fun...

alixdamman commented 4 years ago

Suggestions:

  1. I would to add a boolean keyword argument allow_extra_vars to whether or not allow extra variables during and after model initialization. Defaults to True.

I fear an option nobody will use. The option could come later if we have demand for it.

  1. I would to add a boolean keyword argument allow_mutation to whether or not tell the constrained class is faux-immutable, i.e. whether setattr and setitem are allowed. Defaults to True. This argument could be interesting to create "Parameters" class such as the one defined in the DEMO model.

I fear an option nobody will use. The option could come later if we have demand for it OR we could provide a built-in subclass (e.g. Parameters) with that option enabled.

Providing a built-in subclass Parameters sounds a good idea to me. In this case, Parameters should simulate a faux-immutable class and also forbid to add undeclared variables (and to delete declared variables) IMO.

1. The `filter()`, `compact()` and `apply()` methods of `ConstrainedSession` will return a normal `Session` object. Is putting a Warning section in the doctsring of ConstrainedSession enough or should I print a warning when one of these methods is called. I guess a Warning section is enough but do you think our users will read that section?

They will not read that section, but a dynamic warning would be annoying, so I would only warn in the doc and wait for users to discover it (or we might show one such example of Constrainted-> normal session in the tutorial section about ConstraintedSession)

OK

  1. What to do with binary operations (eq and neq excluded)?

    • a) Always return a normal Session object silently
    • b) Always return a normal Session object and print a warning
    • c) Always return a ConstrainedSession object and raise an error in case of any incompatibility (extra arrays, ...)
    • d) Return a ConstrainedSession object if possible. If not, return a normal Session object and print a warning.

I don't really know. I don't like option c) because this severely limits what you can do, but I am unsure about the other options.

OK to exclude option c) which is indeed too restrictive. I'll go for option d) (but without warnings?).

gdementen commented 4 years ago
  1. What to do with binary operations (eq and neq excluded)?
  * a) Always return a normal `Session` object silently
  * b) Always return a normal `Session` object and print a warning
  * c) Always return a `ConstrainedSession` object and raise an error in case of any incompatibility (extra arrays, ...)
  * d) Return a `ConstrainedSession` object if possible. If not, return a normal `Session` object and print a warning.

I don't really know. I don't like option c) because this severely limits what you can do, but I am unsure about the other options.

OK to exclude option c) which is indeed too restrictive. I'll go for option d) (but without warnings?).

In fact, even option c might make sense as long as we have both 1) a way to easily convert a ConstrainedSession to a normal Session and 2) we support binary ops between a ConstrainedSession and a normal Session without error even if they differ (either returning a ConstrainedSession or Session).

In other words, one more option (I am not saying it is the best option -- I don't know) is to behave differently depending on if we are doing ConstrainedSession + Session or ConstrainedSession + ConstrainedSession. We could be more restrictive in the second case.

In short, I still don't know what to do... Sorry, I really don't have a clear mind these days... I guess you will have to pick one an see how it goes in practice. I usually don't like working like that but...

alixdamman commented 4 years ago

In fact, even option c might make sense as long as we have both 1) a way to easily convert a ConstrainedSession to a normal Session and 2) we support binary ops between a ConstrainedSession and a normal Session without error even if they differ (either returning a ConstrainedSession or Session).

In other words, one more option (I am not saying it is the best option -- I don't know) is to behave differently depending on if we are doing ConstrainedSession + Session or ConstrainedSession + ConstrainedSession. We could be more restrictive in the second case.

In short, I still don't know what to do... Sorry, I really don't have a clear mind these days... I guess you will have to pick one an see how it goes in practice. I usually don't like working like that but...

What I have implemented now (but didn't push yet) is:

        def opmethod(self, other):
                    (...)
                    res.append((name, res_item))
            try:
                # XXX: print a warning?
                ses = self.__class__(res)
            except Exception:
                ses = Session(res)
            return ses

in both _bynaryop and _unaryop.

alixdamman commented 4 years ago

I really don't have a clear mind these days...

Me neither...

gdementen commented 4 years ago

What I have implemented now (but didn't push yet) is:

        def opmethod(self, other):
                    (...)
                    res.append((name, res_item))
            try:
                # XXX: print a warning?
                ses = self.__class__(res)
            except Exception:
                ses = Session(res)
            return ses

in both _bynaryop and _unaryop.

Seems good enough for now. We can always refine this later.

alixdamman commented 4 years ago

Or, in fact, I would even prefer this (the spelling does not matter that much but the structure does):

class MyDemo(ConstrainedSession):
     gender = Axis("gender=female,male")
     time: Axis
     pop: ArrayDef([gender, time])

def init_demo():
    time = Axis('time=1991..2020')
    return MyDemo(time=time, pop=zeros([MyDemo.gender, time])

It sounds like it is not possible with the typing module nor with pydantic.

I knew about typing but I had hopes for pydantic... Too bad.

I was speaking too fast. I finally found some time to take a deeper look inside the pydantic library. I made some quick tests but it seems like that implementing:

AGE = Axis("age=0..120")
GENDER = Axis("gender=female,male")

class MyDemo(ConstrainedSession):
     pop: ArrayDef((AGE, GENDER))

def init_demo():
    return MyDemo(pop=zeros((AGE, GENDER))

is actually possible with pydantic.

Maybe implementing something like pop: ArrayDef((GENDER, 'time')) where the axis time is defined at runtime is possible.

gdementen commented 4 years ago

🎉Yipie!🎉 Now pydantic looks a lot more interesting to me. How does PyCharm behave with such annotations?

alixdamman commented 4 years ago

How does PyCharm behave with such annotations?

Quiet well actually but I have to say that I recently installed the 2020.1 version of PyCharm.