intellimath / recordclass

Mutable variant of namedtuple -- recordclass, which support assignments, compact dataclasses and other memory saving variants.
Other
14 stars 3 forks source link

Recordclass library

Downloads Downloads Downloads

Recordclass is MIT Licensed python library. It was started as a "proof of concept" for the problem of fast "mutable" alternative of namedtuple (see question on stackoverflow). It was evolved further in order to provide more memory saving, fast and flexible types.

Recordclass library provide record/data-like classes that do not participate in cyclic garbage collection (GC) mechanism by default, but support only reference counting for garbage collection. The instances of such classes havn't PyGC_Head prefix in the memory, which decrease their size and have a little faster path for the instance creation and deallocation. This may make sense in cases where it is necessary to limit the size of the objects as much as possible, provided that they will never be part of references cycles in the application. For example, when an object represents a record with fields with values of simple types by convention (int, float, str, date/time/datetime, timedelta, etc.).

In order to illustrate this, consider a simple class with type hints:

class Point:
    x: int
    y: int

By tacit agreement instances of the class Point is supposed to have attributes x and y with values of int type. Assigning other types of values, which are not subclass of int, should be considered as a violation of the agreement.

Other examples are non-recursive data structures in which all leaf elements represent a value of an atomic type. Of course, in python, nothing prevent you from “shooting yourself in the foot" by creating the reference cycle in the script or application code. But in many cases, this can still be avoided provided that the developer understands what he is doing and uses such classes in the codebase with care. Another option is to use static code analyzers along with type annotations to monitor compliance with typehints.

The library is built on top of the base class dataobject. The type of dataobject is special metaclass datatype. It control creation of subclasses, which will not participate in cyclic GC and do not contain PyGC_Head-prefix, __dict__ and __weakref__ by default. As the result the instance of such class need less memory. It's memory footprint is similar to memory footprint of instances of the classes with __slots__ but without PyGC_Head. So the difference in memory size is equal to the size of PyGC_Head. It also tunes basicsize of the instances, creates descriptors for the fields and etc. All subclasses of dataobject created by class statement support attrs/dataclasses-like API. For example:

    from recordclass import dataobject, astuple, asdict
    class Point(dataobject):
        x:int
        y:int

    >>> p = Point(1, 2)
    >>> astuple(p)
    (1, 2)
    >>> asdict(p)
    {'x':1, 'y':2}

The recordclass factory create dataobject-based subclass with specified fields and the support of namedtuple-like API. By default it will not participate in cyclic GC too.

    >>> from recordclass import recordclass
    >>> Point = recordclass('Point', 'x y')
    >>> p = Point(1, 2)
    >>> p.y = -1
    >>> print(p._astuple)
    (1, -1)
    >>> x, y = p
    >>> print(p._asdict)
    {'x':1, 'y':-1}

It also provide a factory function make_dataclass for creation of subclasses of dataobject with the specified field names. These subclasses support attrs/dataclasses-like API. It's equivalent to creating subclasses of dataobject using class statement. For example:

    >>> Point = make_dataclass('Point', 'x y')
    >>> p = Point(1, 2)
    >>> p.y = -1
    >>> print(p.x, p.y)
    1 -1

If one want to use some sequence for initialization then:

    >>> p = Point(*sequence)

There is also a factory function make_arrayclass for creation of the subclass of dataobject, which can be considered as a compact array of simple objects. For example:

    >>> Pair = make_arrayclass(2)
    >>> p = Pair(2, 3)
    >>> p[1] = -1
    >>> print(p)
    Pair(2, -1)

The library provide in addition the classes lightlist (immutable) and litetuple, which considers as list-like and tuple-like light containers in order to save memory. They do not supposed to participate in cyclic GC too. Mutable variant of litetuple is called by mutabletuple. For example:

    >>> lt = litetuple(1, 2, 3)
    >>> mt = mutabletuple(1, 2, 3)
    >>> lt == mt
    True
    >>> mt[-1] = -3
    >>> lt == mt
    False
    >>> print(sys.getsizeof((1,2,3)), sys.getsizeof(litetuple(1,2,3)))
    64 48

Note if one like create litetuple or mutabletuple from some iterable then:

    >>> seq = [1,2,3]
    >>> lt = litetuple(*seq)
    >>> mt = mutabletuple(*seq)

Memory footprint

The following table explain memory footprints of the dataobject-based objects and litetuples:

tuple/namedtuple class with __slots__ recordclass/dataobject litetuple/mutabletuple
g+b+s+n×p g+b+n×p b+n×p b+s+n×p

where:

This is useful in that case when you absolutely sure that reference cycle isn't supposed. For example, when all field values are instances of atomic types. As a result the size of the instance is decreased by 24-32 bytes for cpython 3.4-3.7 and by 16 bytes for cpython >=3.8.

Performance counters

Here is the table with performance counters, which was measured using tools/perfcounts.py script:

id size new getattr setattr getitem setitem getkey setkey iterate copy
litetuple 48 0.18 0.2 0.33 0.19
mutabletuple 48 0.18 0.21 0.21 0.33 0.18
tuple 64 0.24 0.21 0.37 0.16
namedtuple 64 0.75 0.23 0.21 0.33 0.21
class+slots 56 0.68 0.29 0.33
dataobject 40 0.25 0.23 0.29 0.2 0.22 0.33 0.2
dataobject+gc 56 0.27 0.22 0.29 0.19 0.21 0.35 0.22
dict 232 0.32 0.2 0.24 0.35 0.25
dataobject+map 40 0.25 0.23 0.3 0.29 0.29 0.32 0.2
id size new getattr setattr getitem setitem getkey setkey iterate copy
litetuple 48 0.11 0.11 0.18 0.09
mutabletuple 48 0.11 0.11 0.12 0.18 0.08
tuple 64 0.1 0.08 0.17 0.1
namedtuple 64 0.49 0.13 0.11 0.17 0.13
class+slots 56 0.31 0.06 0.06
dataobject 40 0.13 0.06 0.06 0.11 0.12 0.16 0.12
dataobject+gc 56 0.14 0.06 0.06 0.1 0.12 0.16 0.14
dict 184 0.2 0.12 0.13 0.19 0.13
dataobject+map 40 0.12 0.07 0.06 0.15 0.16 0.16 0.12
class 56 0.35 0.06 0.06
id size new getattr setattr getitem setitem getkey setkey iterate copy
litetuple 48 0.13 0.12 0.19 0.09
mutabletuple 48 0.13 0.11 0.12 0.18 0.09
tuple 64 0.11 0.09 0.16 0.09
namedtuple 64 0.52 0.13 0.11 0.16 0.12
class+slots 56 0.34 0.08 0.07
dataobject 40 0.14 0.08 0.08 0.11 0.12 0.17 0.12
dataobject+gc 56 0.15 0.08 0.07 0.12 0.12 0.17 0.13
dict 184 0.19 0.11 0.14 0.2 0.12
dataobject+map 40 0.14 0.08 0.08 0.16 0.17 0.17 0.12
class 48 0.41 0.08 0.08

Main repository for recordclass is on github.

Here is also a simple example.

More examples can be found in the folder examples.

Quick start

Installation

Installation from directory with sources

Install:

>>> python3 setup.py install

Run tests:

>>> python3 test_all.py

Installation from PyPI

Install:

>>> pip3 install recordclass

Run tests:

>>> python3 -c "from recordclass.test import *; test_all()"

Quick start with dataobject

Dataobject is the base class for creation of data classes with fast instance creation and small memory footprint. They provide dataclass-like API.

First load inventory:

>>> from recordclass import dataobject, asdict, astuple, as_dataclass, as_record

Define class one of the ways:

class Point(dataobject):
    x: int
    y: int

or

@as_dataclass()
class Point:
    x: int
    y: int

or

@as_record
def Point(x:int, y:int): pass

or

>>> Point = make_dataclass("Point", [("x",int), ("y",int)])

or

>>> Point = make_dataclass("Point", {"x":int, "y",int})

Annotations of the fields are defined as a dict in __annotations__:

>>> print(Point.__annotations__)
{'x': <class 'int'>, 'y': <class 'int'>}

There is default text representation:

>>> p = Point(1, 2)
>>> print(p)
Point(x=1, y=2)

The instances has a minimum memory footprint that is possible for CPython object, which contain only Python objects:

>>> sys.getsizeof(p) # the output below for python 3.8+ (64bit)
40
>>> p.__sizeof__() == sys.getsizeof(p) # no additional space for cyclic GC support
True

The instance is mutable by default:

>>> p_id = id(p)
>>> p.x, p.y = 10, 20
>>> id(p) == p_id
True
>>> print(p)
Point(x=10, y=20)

There are functions asdict and astuple for converting to dict and to tuple:

>>> asdict(p)
{'x':10, 'y':20}
>>> astuple(p)
(10, 20)

By default subclasses of dataobject are mutable. If one want make it immutable then there is the option readonly=True:

class Point(dataobject, readonly=True):
    x: int
    y: int

>>> p = Point(1,2)
>>> p.x = -1
. . . . . . . . . . . . .
TypeError: item is readonly

By default subclasses of dataobject are not iterable by default. If one want make it iterable then there is the option iterable=True:

class Point(dataobject, iterable=True):
    x: int
    y: int

>>> p = Point(1,2)
>>> for x in p: print(x)
1
2

Default values are also supported::

class CPoint(dataobject):
    x: int
    y: int
    color: str = 'white'

or

>>> CPoint = make_dataclass("CPoint", [("x",int), ("y",int), ("color",str)], defaults=("white",))

>>> p = CPoint(1,2)
>>> print(p)
Point(x=1, y=2, color='white')

But

class PointInvalidDefaults(dataobject):
    x:int = 0
    y:int

is not allowed. A fields without default value may not appear after a field with default value.

There is an option copy_default (starting from 0.21) in order to assign a copy of the default value when creating an instance:

 class Polygon(dataobject, copy_default=True):
    points: list = []

>>> pg1 = Polygon()
>>> pg2 = Polygon()
>>> assert pg1.points == pg2.points
True
>>> assert id(pg1.points) != id(pg2.points)
True

A Factory (starting from 0.21) allows you to setup a factory function to calculate the default value:

from recordclass import Factory

class A(dataobject, copy_default=True):
    x: tuple = Factory( lambda: (list(), dict()) )

>>> a = A()
>>> b = A()
>>> assert a.x == b.x
True
>>> assert id(a.x[0]) != id(b.x[0])
True
>>> assert id(a.x[1]) != id(b.x[1])
True

If someone wants to define a class attribute, then there is a ClassVar trick:

class Point(dataobject):
    x:int
    y:int
    color:ClassVar[int] = 0

>>> print(Point.__fields__)
('x', 'y')
>>> print(Point.color)
0

If the default value for the ClassVar-attribute is not specified, it will just be excluded from the __fields___.

Starting with python 3.10 __match_args__ is specified by default so that __match_args__ == __fields__. User can define it's own during definition:

class User(dataobject):
    first_name: str
    last_name: str
    age: int
    __match_args__ = 'first_name', 'last_name'

or

from recordclass import MATCH
class User(dataobject):
    first_name: str
    last_name: str
    _: MATCH
    age: int

or

User = make_dataclass("User", "first_name last_name * age")

Quick start with recordclass

The recordclass factory function is designed to create classes that support namedtuple's API, can be mutable and immutable, provide fast creation of the instances and have a minimum memory footprint.

First load inventory:

>>> from recordclass import recordclass

Example with recordclass:

>>> Point = recordclass('Point', 'x y')
>>> p = Point(1,2)
>>> print(p)
Point(1, 2)
>>> print(p.x, p.y)
1 2
>>> p.x, p.y = 1, 2
>>> print(p)
Point(1, 2)
>>> sys.getsizeof(p) # the output below is for 64bit cpython3.8+
32

Example with class statement and typehints:

>>> from recordclass import RecordClass

class Point(RecordClass):
   x: int
   y: int

>>> print(Point.__annotations__)
{'x': <class 'int'>, 'y': <class 'int'>}
>>> p = Point(1, 2)
>>> print(p)
Point(1, 2)
>>> print(p.x, p.y)
1 2
>>> p.x, p.y = 1, 2
>>> print(p)
Point(1, 2)

By default recordclass-based class instances doesn't participate in cyclic GC and therefore they are smaller than namedtuple-based ones. If one want to use it in scenarios with reference cycles then one have to use option gc=True (gc=False by default):

>>> Node = recordclass('Node', 'root children', gc=True)

or

class Node(RecordClass, gc=True):
     root: 'Node'
     chilren: list

The recordclass factory can also specify type of the fields:

>>> Point = recordclass('Point', [('x',int), ('y',int)])

or

>>> Point = recordclass('Point', {'x':int, 'y':int})

Using dataobject-based classes with mapping protocol

class FastMapingPoint(dataobject, mapping=True):
    x: int
    y: int

or

FastMapingPoint = make_dataclass("FastMapingPoint", [("x", int), ("y", int)], mapping=True)

>>> p = FastMappingPoint(1,2)
>>> print(p['x'], p['y'])
1 2
>>> sys.getsizeof(p) # the output below for python 3.10 (64bit)
32

Using dataobject-based classes for recursive data without reference cycles

There is the option deep_dealloc (default value is False) for deallocation of recursive datastructures. Let consider simple example:

class LinkedItem(dataobject):
    val: object
    next: 'LinkedItem'

class LinkedList(dataobject, deep_dealloc=True):
    start: LinkedItem = None
    end: LinkedItem = None

    def append(self, val):
        link = LinkedItem(val, None)
        if self.start is None:
            self.start = link
        else:
            self.end.next = link
        self.end = link

Without deep_dealloc=True deallocation of the instance of LinkedList will be failed if the length of the linked list is too large. But it can be resolved with __del__ method for clearing the linked list:

def __del__(self):
    curr = self.start
    while curr is not None:
        next = curr.next
        curr.next = None
        curr = next

There is builtin more fast deallocation method using finalization mechanizm when deep_dealloc=True. In such case one don't need __del__ method for clearing the linked list.

Note that for classes with gc=True this method is disabled: the python's cyclic GC is used in these cases.

For more details see notebook example_datatypes.

Changes:

0.22.1:

0.22.0.3:

0.22.0.2

0.22.0.1

0.22

0.21.1

0.21

0.20.1

0.20

0.19.2

0.19.1

0.19

0.18.4

0.18.3

0.18.2

0.18.1.1

0.18.1

0.18.0.1

0.18

0.17.5

0.17.4

0.17.3

0.17.2

0.17.1

0.17

0.16.3

0.16.2

0.16.1

0.16

0.15.1

0.15

0.14.3:

0.14.2:

0.14.1:

0.14:

0.13.2

0.13.1

0.13.0.1

0.13

0.12.0.1

0.12

0.11.1:

0.11:

0.10.3:

0.10.2

0.10.1

0.10

0.9

0.8.5

0.8.4

0.8.3

0.8.2

0.8.1

0.8

0.7

0.6

0.5

0.4.4

0.4.3

0.4.2