Could not find heap data for the \'MATRIX\' variable-length array column.

volodymyrss commented 1 month ago

An issue observed in isgri-expert workflow by @ferrigno . Might have something to do with new python? Or not.

https://cdci.sentry.io/issues/5783401903/?alert_rule_id=723253&alert_timestamp=1725201418390&alert_type=email&environment=production&notification_uuid=18b8fc99-5748-4013-b594-1cf172b92a6c&project=1467382&referrer=alert_email

ferrigno commented 1 month ago

I have not this issue in my virtual environment. Which is the docker running the service?

volodymyrss commented 1 month ago

I have not this issue in my virtual environment. Which is the docker running the service?

This is not happening in your service.

burnout87 commented 1 month ago

Could it be something related to the astropy version?

volodymyrss commented 1 month ago

I would guess possibly related to astropy, numpy, and python versions.

burnout87 commented 2 weeks ago

I reproduced it locally, and it happens here https://github.com/oda-hub/oda_api/blob/02e47e3670542e763f381364c4a87de99a2a9de9/oda_api/data_products.py#L310

burnout87 commented 2 days ago

After some more debugging with @ferrigno , we think we figured out where the problem is.

It seems to be related to the decode functionality of the class NumpyDataUnit, when called inside the decode function of the NumpyDataProduct class.

We tried to go through each step, and it seems that there is a problem the moment we try to pickle.loads the binary data of the encoded object.

https://github.com/oda-hub/oda_api/blob/ec280c31678356a6144ea4250f0fdbc9021b90a6/oda_api/data_products.py#L474

While researching online, it seems that "pickle isn't supported for HDUs ..." which might explains our issue.

https://github.com/astropy/astropy/issues/15017

What do you think? Perhaps an options is to rewrite those functions, or use a different approach

volodymyrss commented 2 days ago

It is not supported in a sense that nobody guarantees it works. Since we really need serialization of fits objects, it's our responsibility to make sure it does. In the past we made changes and contributions to public packages to make sure this functionality works.

I propose you first of all make a narrow test in oda-api to demonstrate exactly the serialization problem.

Then we see how to best fix it. Actually maybe in our case we can avoid using BufferedReader since we need the entire content at once anyway. May be a way of opening/creating the FITS file/HDU.

andreatramacere commented 2 days ago

This morning I had a zoom meeting with @burnout87, I made a test with ISGRI rmf files, with the same structure as in the IBIS one, at least for the column storing the RMF matrix. The encoding/decoding workflow works as expected, indeed, if that would not work, the entire workflow tested in the oda_api would fail. This means that the issue could be in the IBIS rmf file or in the astropy.fits module. @ferrigno did you ever test IBIS rmf against oda_api? Is it possible that software writing the IBIS rmf uses a different approach or a different cfitsio version compared to the one used in ISGRI? Another possibility could be related to the size of the matrix exceed some buffer. By the way I guess IBIS and ISGRI are different instrument

Probably, the best approach, would be to write fake RMF files, increasing incrementally the size of the matrix column (both in number of rows, and element per row) and check if the problem shows up above a given size

volodymyrss commented 2 days ago

This morning I had a zoom meeting with @burnout87, I made a test with ISGRI rmf files, with the same structure as in the IBIS one, at least for the column storing the RMF matrix. The encoding/decoding workflow works as expected, indeed, if that would not work, the entire workflow tested in the oda_api would fail. This means that the issue could be in the IBIS rmf file or in the astropy.fits module. @ferrigno did you ever test IBIS rmf against oda_api? Is it possible that software writing the IBIS rmf uses a different approach or a different cfitsio version compared to the one used in ISGRI? Another possibility could be related to the size of the matrix exceed some buffer. By the way I guess IBIS and ISGRI are different instrument

Probably, the best approach, would be to write fake RMF files, increasing incrementally the size of the matrix column (both in number of rows, and element per row) and check if the problem shows up above a given size

ISGRI is part of IBIS. ISGRI is the only part of IBIS we use in ODA. For us IBIS==ISGRI.

There can be differences, especially in what concerns variable size arrays. Which is why I ask @burnout87 to please add a test for the file @ferrigno had trouble with.

I agree that there must be something peculiar or problematic about this file. And indeed we were writing and passing this kind of files RMF since the very beginning.

ferrigno commented 1 day ago

All concurs towards the fact that one needs to re-engineer the oda_api data_product handling with numpy>=2.0. Now astropy is compatible with numpy>=2.0.0, but oda_api not anymore. One needs to make a completely clean installation to find it out.

volodymyrss commented 1 day ago

All concurs towards the fact that one needs to re-engineer the oda_api data_product handling with numpy>=2.0. Now astropy is compatible with numpy>=2.0.0, but oda_api not anymore. One needs to make a completely clean installation to find it out.

Completely clean installation is in the CI in the actions. Did you already try to add the test?

ferrigno commented 1 day ago

Not finished to make the test, maybe later.

volodymyrss commented 1 day ago

It remains unclear to me why it's hard to explain that starting the the test is the best way to achieve clean result which can be discussed and communicated easily.

ferrigno commented 1 day ago

test added https://github.com/oda-hub/oda_api/blob/d2036162e47cc31c71cc3bfa04e9ad4b25771bed/tests/test_data_products.py#L40 it fails with : astropy version 6.1.3 numpy version 1.26.4 oda_api Version: 1.2.28

ferrigno commented 22 hours ago

See the related PR https://github.com/oda-hub/oda_api/pull/286 Reproduced the failure in the test on CI

oda-hub / oda_api

Could not find heap data for the \'MATRIX\' variable-length array column. #288