ledatelescope / bifrost_tcc_wrapper

A wrapper around the Tensor Core Correlator for the Bifrost framework
BSD 3-Clause "New" or "Revised" License
1 stars 2 forks source link

plugin issue `expected LP_struct_BFarray_ instance instead of struct_BFarray_` #1

Open telegraphic opened 2 years ago

telegraphic commented 2 years ago

(note: this is 'solved', but I'm writing up here for prosperity and discussion)

@ldryan0 and I have been working on getting the Bifrost TCC plugin working on the topaz supercomputer.

To get it to compile in py 3.6, with the autoconf branch required some makefile edits (we will PR these once we've tidied it up).

Once compiled, we ran into an 'interesting' error when running it:

ctypes.ArgumentError: argument 2: <class 'TypeError'>: 
expected LP_struct_BFarray_ instance instead of struct_BFarray_

This turns out to be due to the definition of class struct_BFarray_(Structure) in btcc_generated.py, which even though it looks identical the definition of class struct_BFarray_ in bifrost.libbifrost_generated, is being considered a different class by ctypes.

Our fix was straightforward in the end: remove this from btcc_generated.py:

# ../src/bifrost/array.h: 128
class struct_BFarray_(Structure):
    pass

struct_BFarray_.__slots__ = [
    'data',
    'space',
    'dtype',
    'ndim',
    'shape',
    'strides',
    'immutable',
    'big_endian',
    'conjugated',
]
struct_BFarray_._fields_ = [
    ('data', POINTER(None)),
    ('space', BFspace),
    ('dtype', BFdtype),
    ('ndim', c_int),
    ('shape', c_long * int(BF_MAX_DIMS)),
    ('strides', c_long * int(BF_MAX_DIMS)),
    ('immutable', BFbool),
    ('big_endian', BFbool),
    ('conjugated', BFbool),
]

BFarray = struct_BFarray_# ../src/bifrost/array.h: 128

and replace with:

from bifrost.libbifrost_generated import struct_BFarray_
BFarray = struct_BFarray_

@jaycedowell: I'm assuming you didn't have to do this?

Background debug info

For some more debug information, the error was being raised here in btcc_genereatd.py:

    BTccExecute = _lib.BTccExecute
    # print("HERE", type(BFarray), type(POINTER(BFarray)))
    BTccExecute.argtypes = [btcc, POINTER(BFarray), POINTER(BFarray), BFbool]
    BTccExecute.restype = BFstatus

The LP_ prefix in LP_struct_BFarray_ is ctypes telling us it is a pointer to the struct_BFarray,and we are indeed passing POINTER(BFarray) to BTccExecute.argtypes. For example, changing the code to POINTER(POINTER(BFarray)) will result in LP_LP_struct_BFarray_.

The expected LP_struct_BFarray_ instance instead of struct_BFarray_ error message is misleading here, removing the POINTER() and running the code will result in a segfault (we do want to pass a pointer after all). The cryptic hint that helped me solve it was when after some trial and error the traceback said:

expected struct_BFarray_ instance instead of struct_BFarray_

!

jaycedowell commented 2 years ago

I did not but I've also only tested it under Python2.Dinosaur so this is useful feedback. It will also help with the development of plugin-wrapper since BTCC is based on that.

jaycedowell commented 2 years ago

Actually, could you transfer this issue over to BTCC so I can keep track of it?

jaycedowell commented 2 years ago

That seems to have worked.

telegraphic commented 2 years ago

Yup, looks like it! (I can't see the transfer button so guessing I don't have right permissions)

jaycedowell commented 2 years ago

I started looking at this today since I'm working on moving ADP over to Py3 but I'm just not hitting this problem. The strange thing is that my btcc_generated.py doesn't have any reference to a # ../src/bifrost/array.h: 128 in it or any of those struct_BFarray_ declarations. What does your call to make_bifrost_plugin.py look like?

telegraphic commented 2 years ago

In btcc_generated.py, the header is:

'''Wrapper for btcc.h

Generated with:
-c -lbtcc -I. -Igroup/director2183/dancpr/src/bifrost_tcc_wrapper/src btcc.h -o btcc_generated.py

Do not modify this file.
'''

libbtcc was generated with:

nvcc -Xcompiler=-Wall,-Winvalid-pch,-Wnon-virtual-dtor -g -G -g -Xcompiler=-fPIC -I/pawsey/centos7.6/devel/binary/cuda/11.1/include -I/group/director2183/dancpr/software/centos7.6/apps/cascadelake/gcc/8.3.0/bifrost/0.0.2/include -I/group/director2183/dancpr/src/tensor-core-correlator/libtc    c -I/group/director2183/dancpr/src/tensor-core-correlator/libtcc -I/group/director2183/dancpr/src/tensor-core-correlator/util -I/group/director2183/dancpr/src/tensor-core-correlator/util -I/group/director2183/dancpr/src/tensor-core-correlator -I/group/director2183/dancpr/src/tensor-core-correlator -I.. -I    . -Ilibbtcc.so.0.0.2.p -Ilibbtcc.so.0.0.2.p -o libbtcc.so.0.0.2.p/btcc.cu.o -c ../btcc.cu

It is a pain to install things on the supercomputer, so my most recent setup is using the meson build system (which I quite like). bifrost itself is built using a strange Pawsey-specific build system called MAALI. However I recall hitting this error when compiling on a more regular GPU system. I'll try and find a minimal example to recreate