intellimath / recordclass

Mutable variant of namedtuple -- recordclass, which support assignments, compact dataclasses and other memory saving variants.
Other
14 stars 3 forks source link

Per instance object type default property value #2

Closed char101 closed 10 months ago

char101 commented 10 months ago

Hi,

When creating a class with object type default value, e.g.

class A(dataobject):
    l: list = []

the value of l is actually shared between the instances. I think 99% percent of the use case actually want a per instance l attribute value. When defining a constructor manually we can set None as the default value for l and then assign it in the constructor but in this case it is not clear whether l is shared or not.

Do you think it is possible to create a copy of the default value when it is of list/dict/object type when creating the instance?

intellimath commented 10 months ago

Do I understand correctly that you want each instance to be initialized at creation so that ob.l has its own []?

char101 commented 10 months ago

Yes, because it is not obvious that [] is a default parameter in a function call, so while I might remember it now, I might forget it in the future and it will introduce bug in the program.

If I create a normal class

class A:
    def __init__(self, l=[]):
        self.l = l

it is obvious here that [] is shared because it is a default parameter value. So I will instead write it as

class A:
    def __init__(self, l=None):
        self.l = l or []

But when using dataobject

class A(dataobject):
    l: list = []

it is not obvious whether [] is a default parameter in the constructor or whether it is assigned inside the constructor.

Since it is more likely that not for a data object that the user wants [] not to be a shared value, I think it would be better if possible if recordclass could detect if the default value is a list/dict/object and initialize it inside the constructor instead of making it a default parameter value.

intellimath commented 10 months ago

May be something like this?

   class A(dataobject):
         l: list = Copy([])

where Copy(ob) is a special object of the type Copy so that at the instantiation phase the value of ob.__copy__() will be assigned to the field of the instance?

intellimath commented 10 months ago

Sorry list has no __copy__ method, so a more complex scenario will be required:

tp = type(ob)
if tp is list:
       return ob[:]
if tp is dict:
       return ob.copy()
else:
      return ob.__copy__()
char101 commented 10 months ago

May be something like this?

   class A(dataobject):
         l: list = Copy([])

where Copy(ob) is a special object of the type Copy so that at the instantiation phase the value of ob.__copy__() will be assigned to the field of the instance?

This answers the how question, i.e. how to initialize the instance with a copy of the list?

But there is another problem, let's say I haven't used this library for months, then I might not remember to use CopyOf when assigning the default value.

Of course, I understand if you don't want to change the existing behavior. In that case, probably the best thing that can be done is to add a method to tag the default value to be copied like using CopyOf above and add it in the documentation.

In my case I will probably fork the source and change the behavior to automatically copy list/dict/object if it is possible to address the second problem above.

intellimath commented 10 months ago

The present behavior is consistent with how default values are passed when calling a function in python. This behavior allows you not to lose performance when simple values (not containers) are passed as parameters. Therefore, it seems to me that passing a value wrapped with a special type is a possible mechanism to solve the problem. Another solution that you suggested is to add a flag by which all values of known container types will be copied and re-assigned.

   class A(dataobject, copy=True):
            l: list = []

I could even implement both options.

char101 commented 10 months ago

Sorry list has no __copy__ method, so a more complex scenario will be required:

tp = type(ob)
if tp is list:
       return ob[:]
if tp is dict:
       return ob.copy()
else:
      return ob.__copy__()

For nested values, copy.deepcopy might be required.

import copy

l = [1,2,3]
l2 = copy.deepcopy(l)
l.append(4)
print(l2)

d = {'a': 1}
d2 = copy.deepcopy(d)
d['b'] = 2
print(d2)

l = [[1,2,3]]
l2 = l[:]
l[0].append(4)
print(l2) # shallow copy
intellimath commented 10 months ago

Another more general approach comes to mind

 class A(data object):
       l : list = Convert(list, [])

 def create_class(default):
       class A(data object):
             l : list = Convert(list, default)
       return A
char101 commented 10 months ago

The present behavior is consistent with how default values are passed when calling a function in python.

I think this is exactly the problem. Defining a dataobject does not feel like calling a function. Which is why I or other people might not realize that the default value follows function call default value semantic, i.e. it is only created once.

intellimath commented 10 months ago

Then one need to explicitly specify the change in semantics?

   class A(dataobject, copy=True):
            l: list = []
char101 commented 10 months ago

Then one need to explicitly specify the change in semantics:

   class A(dataobject, copy=True):
            l: list = []

I think this one is good syntax. copy is too general, maybe deepcopyobj?

intellimath commented 10 months ago

I think copyobj and deepcopyobj could be two different options?

char101 commented 10 months ago

I think copyobj and deepcopyobj could be two different options?

I don't have opinion on this, but by the way do you think copyarg and deepcopyarg might be a better name since the one that is copied is the default parameter value? On the other hand copyparam and deepcopyparam feels too long.

intellimath commented 10 months ago

copyarg and deepcopyarg seems to be shorter.

intellimath commented 10 months ago

There is one important circumstance: by default, instances of

class A(dataobject):
   l: list

do not participate in the cyclic garbage collection mechanism (only reference counting), which reduces memory consumption, etc. As soon as complex objects (for example, containers) are used as values, there is a risk of a reference cycle. Therefore, one need to use such classes with a care.

Or explicitly specify:

 class A(dataobject, gc=True):
         l: list
char101 commented 10 months ago

If you decide to implement those options, is there a way to change the default value to True (then I might be able to put it in sitecustomize.py)?

From what I see, the only way to change the default value is to change the default parameter value in dataclass.__new__.

intellimath commented 10 months ago

In principle, it would be possible to set default parameter values in the datatype metaclass, which could be changed after import:

  from recordclass.datatype import datatype

  datatype.config['copyarg'] = True
char101 commented 10 months ago

In principle, it would be possible to set default parameter values in the datatype metaclass, which could be changed after import:

  from recordclass.datatype import datatype

  datatype.config['copyarg'] = True

Yes, I think it would be great if the keyword parameters of dataclass.__new__ can be moved into a static property of the dataclass class, and then the __new__ method would just accept **kwargs. Setting the options then can be done by looping the config property and matching it with the kwargs values.

intellimath commented 10 months ago

There is one important circumstance: by default, instances of

class A(dataobject):
   l: list

do not participate in the cyclic garbage collection mechanism (only reference counting), which reduces memory consumption, etc. As soon as complex objects (for example, containers) are used as values, there is a risk of a reference cycle. Therefore, one need to use such classes with a care. Or explicitly specify:

 class A(dataobject, gc=True):
         l: list

Great point, maybe if copyarg is specified, the default value for gc can be automatically set to True, unless the user explicitly set it to False?

To be honest, I would not like to set gc=True by default. Many complex objects do not necessarily participate in the reference cycles unless they are part of recursive data structures and are used under the control of the developer. If not then you really need to set gc=True.

intellimath commented 10 months ago

In principle, it would be possible to set default parameter values in the datatype metaclass, which could be changed after import:

  from recordclass.datatype import datatype

  datatype.config['copyarg'] = True

Yes, I think it would be great if the keyword parameters of dataclass.__new__ can be moved into a static property of the dataclass class, and then the __new__ method would just accept **kwargs. Setting the options then can be done by looping the config property and matching it with the kwargs values.

If this needs to be set for the class being created, then copyarg=True may be sufficient. If this needs to be made the default behavior for all created dataclasses, then maybe it's better to set it at the metaclass datatype level?

char101 commented 10 months ago

To be honest, I would not like to set gc=True by default. Many complex objects do not necessarily participate in the reference cycles unless they are part of recursive data structures and are used under the control of the developer. If not then you really need to set gc=True.

You are right. I misunderstood the gc option.

By the way, what would happen if a dataobject instances references each other without specifying gc=True? Will python crash or will it results in memory leak?

intellimath commented 10 months ago

i.e. one could be write:

from recordclass import datatype datatype.config['copyarg'] = True

class A(dataobject): l : list = []

class B(dataobject): d : dict = {}

intellimath commented 10 months ago

To be honest, I would not like to set gc=True by default. Many complex objects do not necessarily participate in the reference cycles unless they are part of recursive data structures and are used under the control of the developer. If not then you really need to set gc=True.

You are right. I misunderstood the gc option.

By the way, what would happen if a dataobject instances references each other without specifying gc=True? Will python crash or will it results in memory leak?

There may be a reference cycle. If one not care about reference cycles with his instances then better to set gc=True.

intellimath commented 10 months ago

or

   from recordclass import datatype

   datatype.config['gc'] = True

?

intellimath commented 10 months ago

There may be another solution in order to do not touch the global config:

   class Base(dataobject, copyarg=True, gc=True):
              pass

   class A(Base):
         l : list = []
char101 commented 10 months ago

i.e. one could be write:

from recordclass import datatype datatype.config['copyarg'] = True

class A(dataobject): l : list = []

class B(dataobject): d : dict = {}

I think there are different use cases here

  1. I mainly work on personal projects, so there is no problem with me changing the default behavior, so I might simply put datatype.config['copyarg'] in sitecustomize.py and make it the default behavior for all codes.
  2. Other people that works in a team or in a large codebase might want to be more conservative by changing the option per class.

I can't think of a use case where the user want a list or dict to be shared between instances of a data object, so personally I think a good default might be

  1. automatically (shallow) copy list and dict in default parameter values
  2. if copyobj is specified, also make shallow copy of objects in default parameter values (since copying objects might have side effect)
  3. if deepcopyarg is specified, make deep copy of list, dict, and objects
char101 commented 10 months ago

There may be another solution in order to do not touch the global config:

   class Base(dataobject, copyarg=True, gc=True):
              pass

   class A(Base):
         l : list = []

I didn't know that the parameter can also be passed via inheritance. Yes I can use this, I only need to create a dataobject wrapper in my own module and then change the import.

intellimath commented 10 months ago

There may be another solution in order to do not touch the global config:

   class Base(dataobject, copyarg=True, gc=True):
              pass

   class A(Base):
         l : list = []

I didn't know that the parameter can also be passed via inheritance. Yes I can use this, I only need to create a dataobject wrapper in my own module and then change the import.

This is not current behaviour. It could be implemented for the next release?

intellimath commented 10 months ago

I now see the following plan for inclusion in the next release:

  1. Add a new copyarg parameter (False by default).
  2. Implement inheritance for options (gc, copyarg, readonly, etc.).

Then you can proceed as follows:

 class Base(dataobject, copyarg=True): # gc=True, if it needs
         pass

 class A(Base):
        l: list = []

Is that enough to solve your problem?

char101 commented 10 months ago

Yes, that should be enough, thank you.

intellimath commented 10 months ago

I just realized that the name copy_default would be more appropriate instead of copyarg, since only the default value of the field is copied.

intellimath commented 10 months ago

Corrected plan for inclusion in the next release:

  1. Add a new copy_default parameter (False by default).
  2. Implement inheritance for options (gc, copy_default, readonly, etc.).

Then you can proceed as follows:

 class Base(dataobject, copy_default=True): # gc=True, if it needs
         pass

 class A(Base):
        l: list = []
char101 commented 10 months ago

Another possibility is copyref since only reference type is copied.

intellimath commented 10 months ago

The point is that only the default value for the field of the dataclass instance is copied. For example, after a=A(arg) this is true: a.l is arg.

char101 commented 10 months ago

Hi, thank you for implementing the copy_default option. That was fast. While at it, do you think set also could be copied? Since python has 4 builtins data structures: list, dict, set, and tuples.

class A(dataobject):
    s: set = set()
intellimath commented 10 months ago

Yes, I can add that too. But a shallow copy of the tuple doesn't make sense. The option copy_default support only shallow copy. Next time when deepcopy_default will be added support for tuple will be resonable.

char101 commented 10 months ago

Tuple is immutable so like you said there is no need to copy it.

Regarding deepcopying, do you think a factory method like used by attrs is possible?

https://www.attrs.org/en/stable/init.html#defaults

This would handle more cases including deepcopying and object construction (the constructor is not called when using object __copy__, the user might require the constructor to be run everytime the object is created).

from recordclass import dataobject, Factory
class A:
    o1: Person = Factory(lambda: Person())
    o2: tuple = Factory(lambda: (AnObject(), AnotherObjet())
    o3: Person = Factory(Person)
    created: datetime = Factory(datetime.now)
intellimath commented 10 months ago

It's a good idea.

intellimath commented 10 months ago

Now Factory support is in the repository.

char101 commented 10 months ago

Thank you. I will close the issue now since it has been resolved.