Release issues - Githubissues

scarrazza commented 1 year ago

Let me collect some issues which are blocking the first release and should be addressed as soon as possible:

[ ] Tests are still failing in the main branch, see here
[ ] There is no workflow for poetry deployment in this repository and in the workflows.

[ ] We cannot have a pypi package which includes pandas-pint master as dependency:

Invalid value for requires_dist. Error: Can't have direct dependency: 
'Pint-Pandas @ git+https://github.com/hgrecco/pint-pandas.git@master'

The first and last points are critical for the moment.

alecandido commented 1 year ago

We cannot have a pypi package which includes pandas-pint master as dependency:

The latest commit has been on November 24th, and the latest release, v0.3, on November 15th, and it is also on PyPI

https://github.com/hgrecco/pint-pandas/commits/master https://pypi.org/project/Pint-Pandas/

After the tag the source has not been touched (judging from the commit messages), so v0.3 should be perfectly equivalent

https://github.com/hgrecco/pint-pandas/commits/master/pint_pandas/pint_array.py

alecandido commented 1 year ago

However, given the popularity and maintenance of that project (https://github.com/hgrecco/pint-pandas/graphs/contributors), I would trim this dependency later on.

Software reuse is a great thing, but for a long-term project it is worth to keep only very reliable dependencies.

Concerning pandas itself: this is a very reliable project, but I would keep in the user facing part, since I don't believe here you gain much from Dataframe manipulations, e.g. inside the calibrations subpackage. I would just replace with dedicated objects containing Numpy arrays.

However, all of this will naturally come as part of #233 and the auto-calibration implementation anyhow. So, I would not do it immediately, and just limit to switch Pint-pandas v0.3 dependency for this release.

alecandido commented 1 year ago

Tests are still failing in the main branch, see here

Regarding the failing test: maybe there is an easier solution for that test in particular, but a general one when Numpy arrays are involved is to get the binary dump of an array, and encode in a base64 string (all doable with Numpy and standard library).

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tobytes.html https://docs.python.org/3/library/base64.html#base64.b64encode

import base64
import numpy as np

a = np.arange(10, dtype=np.complex128)
# array([0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j, 7.+0.j,
#        8.+0.j, 9.+0.j])

s = base64.b64encode(a.tobytes()).decode()
# 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAPA/AAAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAACEAAAAAAAAAAAAAAAAAAABBAAAAAAAAAAAAAAAAAAAAUQAAAAAAAAAAAAAAAAAAAGEAAAAAAAAAAAAAAAAAAABxAAAAAAAAAAAAAAAAAAAAgQAAAAAAAAAAAAAAAAAAAIkAAAAAAAAAAAA=='

b = np.frombuffer(base64.b64decode(s.encode()), dtype=np.complex128)
# array([0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j, 7.+0.j,
#       8.+0.j, 9.+0.j])

assert np.allclose(a, b)

alecandido commented 1 year ago

The only problem of the former solution is that shape and dtype information is lost. A more complete solution replaces the simplified binary dump of .tobytes() with:

from io import BytesIO

buf = BytesIO()
np.save(buf, a)
buf.seek(0)
s = base64.b64encode(buf.read()).decode()
#  'k05VTVBZAQB2AHsnZGVzY3InOiAnPGMxNicsICdmb3J0cmFuX29yZGVyJzogRmFsc2UsICdzaGFwZSc6ICgxMCwpLCB9ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIAoAAAAAAAAAAAAAAAAAAAAAAAAAAAAA8D8AAAAAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAIQAAAAAAAAAAAAAAAAAAAEEAAAAAAAAAAAAAAAAAAABRAAAAAAAAAAAAAAAAAAAAYQAAAAAAAAAAAAAAAAAAAHEAAAAAAAAAAAAAAAAAAACBAAAAAAAAAAAAAAAAAAAAiQAAAAAAAAAAA'
# in bytes it would like:
#  b'\x93NUMPY\x01\x00v\x00{\'descr\': \'<c16\', \'fortran_order\': False, \'shape\': (10,), }                                                          \n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1c@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 @\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"@\x00\x00\x00\x00\x00\x00\x00\x00'

b = np.load(BytesIO(base64.b64decode(s.encode())))
assert np.allclose(a, b)

https://docs.python.org/3/library/io.html#binary-i-o https://numpy.org/doc/stable/reference/generated/numpy.save.html#numpy.save

Still using only standard library and Numpy, with no explicit information on shape or dtype required.

And a bit of metrics:

a consists of 160 bytes (each np.complex128 is obviously 16 bytes) len(a.tobytes()) == 160
the binary dump of np.save adds a header that is more or less a constant price len(buf.read()) = 288, meaning 128 bytes of overhead (here a lot, but not that much for a bigger array)
the base64 encoding can encode up to 64 different values in one byte, instead of 256 (i.e. you lose two bits), thus in general you encode 3 bytes (24 bits) in 4 characters (32 bits), getting a +33% increase, in this case len(s) == 384 == 288 * 4 / 3

At this price (essentially only the +33% for bigger arrays) you can encode any array, preserving all its information, in a safe string. The constant overhead is critical if you start encoding many scalars separately, at that point it is much more convenient to use .tobytes() directly and store the dtype information somewhere else (e.g. a convention between dumper and loader).

If space is critical, base85 limits the increase to +25%, and it is still available in standard library: https://docs.python.org/3/library/base64.html#base64.b85encode

stavros11 commented 1 year ago

However, given the popularity and maintenance of that project (https://github.com/hgrecco/pint-pandas/graphs/contributors), I would trim this dependency later on.

Concerning pint, let me add that from a usability perspective I found it adds some complexity which I do not think is outweighted by the benefits it provides. My experience with qibocal as a user is small, but I believe the people in the lab share the same opinion. I would agree to consider removing pint at some point, certainly not for the first release. Maybe we should open another issue about this.

Concerning pandas itself: this is a very reliable project, but I would keep in the user facing part, since I don't believe here you gain much from Dataframe manipulations, e.g. inside the calibrations subpackage. I would just replace with dedicated objects containing Numpy arrays.

As you say, the main place where pandas is used is the plotting functions and maybe fitting which is post-processing, not in the data acquisition (which is the calibrations). While I am not very familiar with the plots and fitting, when I played a bit with the custom numpy dtype that @AleCandido you proposed for the ExecutionResults in qibolab I thought that it could be a good candidate for the data acquisition here. For example in #195, it would make the data saving much simpler in some cases, such as https://github.com/qiboteam/qibocal/blob/374109e87b5b32f392222110471227b88141a534/src/qibocal/calibrations/characterization/resonator_spectroscopy.py#L551

where in columns like the qubit or frequency/bias, the same value needs to be casted in multiple rows. Deciding about pandas is probably less trivial than pint though.

scarrazza commented 1 year ago

@andrea-pasquale could you please update poetry, including the latest pint tag?

andrea-pasquale commented 1 year ago

@andrea-pasquale could you please update poetry, including the latest pint tag?

I've opened #250 to address this point. I was playing around with the code on Saturday in order to fix tests. To get a clearer picture I tried to run workflows multiples times.

[ ] Tests are still failing in the main branch, see here

For this I had a look and I think that a possible solution is to make sure that we are plotting only real variables.

I think that there are still some assertions which are a bit weird and they tend to fail randomly. I will have a look but I think if the coverage does not decrease we can simply remove them.

I will open a draft PR to debug tests so everyone can have a look.

andrea-pasquale commented 1 year ago

Regarding pint I also think that it is not super-useful given that we are not using heavily some of the features offered by the library. At the beginning I thought that it will be useful mainly to store units of measure and to be able to convert to any wanted units of measure.

Pint has also generated several issues in the code, particularly when dealing with data saving and loading. It has also quite a rigid structure.

Having said that, I'm also in favor of removing pint as soon as possible. Custom dtype should be perfectly able to do the job.

alecandido commented 1 year ago

Having said that, I'm also in favor of removing pint as soon as possible.

I agree, but I'd just try to release the code as is right now, and strip pint in a later release.

andrea-pasquale commented 1 year ago

Can we close this issue @scarrazza, or there is something else that needs to be done before the first release?

scarrazza commented 1 year ago

Yes, we can close, and later today I will prepare the new release, thank you all.

qiboteam / qibocal

Release issues #249