Complex numbers (in ak.from_iter and elsewhere)

sjperkins commented 4 years ago

Probably low priority, as awkward1.from_numpy does.

>>> import awkward1 as ak
>>> import numpy as np

>>> ak.from_iter([0j])

~/venv/awkward/lib/python3.6/site-packages/awkward1/operations/convert.py in from_iter(iterable, highlevel, behavior, allow_record, initial, resize)
    359     out = awkward1.layout.ArrayBuilder(initial=initial, resize=resize)
    360     for x in iterable:
--> 361         out.fromiter(x)
    362     layout = out.snapshot()
    363     if highlevel:

ValueError: cannot convert 0j (type complex) to an array element

>>> ak.from_numpy(np.asarray([0j]))
 <Array [0j] type='1 * complex128'>

jpivarski commented 4 years ago

The unhandled dtypes are

float16: need to find a C++ library for it
float128: same problem
complex64, complex128: C++'s built-in std::complex can do this; it's just a matter of filling it in everywhere
complex256: complex and non-standard floating-point type
datetime64, timedelta64: these require a little more work than the above, since we need to work around the buffer protocol, which doesn't recognize these, and they also have special rules in reducers (can't add datetimes to each other, but can add timedeltas; multiplication doesn't make sense for either). Nevertheless, this is a somewhat higher priority, since dates and times are ubiquitous in Pandas data.
fixed-width bytestring and string types: variable-length string objects are more common in Pandas data, and they're handled as non-primitive lists with parameter["__array__"] == "bytestring" or "string". The fixed-width bytestrings would come up if someone were analyzing 6-byte MAC addresses or something.

Other than these dtypes, NumPy arrays can contain Python objects (which we won't ever do), record stuctures, which we convert to Awkward records, masked-out data, which we convert to option-type, and non-trivial shapes, which we convert to regular-length lists.

You say that complex numbers are a low priority. But if there's ever a strong need for them, they can be added by following the pattern set by the other types (i.e. search for all "float32" and add cases for the complex numbers). It might even be a "good first issue."

jpivarski commented 4 years ago

I totally missed the fact that this had been on the Awkward 0 repository. New development is in Awkward 1 only.

(All the things I said about handling NumPy dtypes only applies to Awkward 1; Awkward 0 isn't anywhere near as advanced.)

sjperkins commented 4 years ago

Thanks for the detailed response @jpivarski and all the work that you've put in!

You say that complex numbers are a low priority. But if there's ever a strong need for them, they can be added by following the pattern set by the other types (i.e. search for all "float32" and add cases for the complex numbers). It might even be a "good first issue."

I initially regarded it as low priority because in the example I posted it seemed possible to construct an awkward array from a numpy array of complex numbers. Thus, a workaround seemed possible:

>>> A = ak.from_numpy(np.asarray([0j, 1j]))
<Array [0j, 1j] type='2 * complex128'>
>>> ak.to_arrayset(A)
({
     "class": "NumpyArray",
     "itemsize": 16,
     "format": "Zd",
     "primitive": "complex128",
     "form_key": "node0"
 },
 {'node0': array([0.+0.j, 0.+1.j])},
 None)

However, from your response I understand that you're saying that the C++ layer needs to change to handle complex dtype's? If so, I'd probably be keen to make that change. I'll take a look at the code.

I'm coming from a Radio Astronomy context where the observational data is complex. It's ingested as dense arrays, but more recent averaging algorithms can make it ... awkward ;-).

jpivarski commented 4 years ago

In that case, I can give you some pointers, so that you can evaluate how big of a project it will be.

For the complex types, there are fortunately stubs everywhere: they throw std::runtime_errors, but that makes it convenient to find all the places where something needs to be inserted.

% fgrep -r complex128 include src tests
include/awkward/util.h:        complex128,
src/python/forms.cpp:        case ak::util::dtype::complex128:
src/awkward1/operations/structure.py:    numpy.dtype(numpy.complex128): "complex128",
src/libawkward/Content.cpp:                 util::dtype_to_format(util::dtype::complex128),
src/libawkward/Content.cpp:                 util::dtype::complex128);
src/libawkward/util.cpp:      else if (name == "complex128") {
src/libawkward/util.cpp:        return util::dtype::complex128;
src/libawkward/util.cpp:      case util::dtype::complex128:
src/libawkward/util.cpp:        return "complex128";
src/libawkward/util.cpp:        return dtype::complex128;
src/libawkward/util.cpp:      case dtype::complex128:
src/libawkward/util.cpp:      case dtype::complex128:
src/libawkward/util.cpp:      case dtype::complex128:
src/libawkward/array/NumpyArray.cpp:        case util::dtype::complex128:
src/libawkward/array/NumpyArray.cpp:          throw std::runtime_error("FIXME: complex128 to JSON");
src/libawkward/array/NumpyArray.cpp:            dtype_ == util::dtype::complex128  ||
src/libawkward/array/NumpyArray.cpp:            rawother->dtype() == util::dtype::complex128  ||
src/libawkward/array/NumpyArray.cpp:      else if (dtype_ == util::dtype::complex128  ||
src/libawkward/array/NumpyArray.cpp:               rawother->dtype() == util::dtype::complex128) {
src/libawkward/array/NumpyArray.cpp:        dtype = util::dtype::complex128;
src/libawkward/array/NumpyArray.cpp:        dtype = util::dtype::complex128;
src/libawkward/array/NumpyArray.cpp:      // to complex128
src/libawkward/array/NumpyArray.cpp:      case util::dtype::complex128:
src/libawkward/array/NumpyArray.cpp:        throw std::runtime_error("FIXME: merge to complex128 not implemented");
src/libawkward/array/NumpyArray.cpp:      case util::dtype::complex128:
src/libawkward/array/NumpyArray.cpp:        throw std::runtime_error("FIXME: reducers on complex128");
src/libawkward/array/NumpyArray.cpp:      case util::dtype::complex128:
src/libawkward/array/NumpyArray.cpp:        throw std::runtime_error("FIXME: sort for complex128 not implemented");
src/libawkward/array/NumpyArray.cpp:      case util::dtype::complex128:
src/libawkward/array/NumpyArray.cpp:        throw std::runtime_error("FIXME: argsort for complex128 not implemented");
src/libawkward/array/NumpyArray.cpp:      case util::dtype::complex128:
src/libawkward/array/NumpyArray.cpp:        throw std::runtime_error("FIXME: numbers_to_type for complex128 not implemented");
src/libawkward/array/NumpyArray.cpp:    case util::dtype::complex128:
src/libawkward/array/NumpyArray.cpp:      throw std::runtime_error("FIXME: as_type for complex128 not implemented");

I just looked at these search results in context and none of them need any work except the ones in NumpyArray.cpp. NumpyArray.cpp needs:

a conversion of complex numbers to and from JSON: what would that be? Is there a standard, like {"real": #, "imag": #} or maybe {"r": #, "i": #}? Do other libraries converge on some convention?
both NumpyArray::merge and NumpyArray::numbers_to_type call kernel::NumpyArray_fill<X, Y> (the latter does so indirectly), and this needs to be specialized to include complex cases in the X and Y. That goes into kernel functions named awkward_NumpyArray_fill_toY_fromX, which are described in more detail below.
the reducers need complex cases, which is a straightforward extrapolation of the floating point ones: count, count_nonzero (implementations don't depend on the type; it's just a function signature), sum, prod (add the function signatures and the compiler does the work), any, all (have to determine whether a complex number is zero or not, but otherwise formulaic), min, max, argmin, argmax (can raise std::invalid_argument("don't do that!") because complex numbers are unordered).
there's a stub for sorting and arg-sorting complex numbers, but this can raise std::invalid_argument("don't do that!") because complex numbers are unordered.

Kernel functions separate code that actually touches the arrays from all the rest of the codebase. They're a separate layer in the architecture:

and are, in fact, a separate shared library that can be dynamically loaded (libawkward-cpu-kernels.so and libawkward-cuda-kernels.so). The actual implementations of these functions (filling arrays and counting, summing, multiplying) are very simple and are templated by numerical type:

https://github.com/scikit-hep/awkward-1.0/blob/029ae3991849d391733043db5e87389e333638fc/src/cpu-kernels/operations.cpp#L1136-L1146

However, that extern "C" interface between the kernels layer and the C++ layer means that all of the kernel cases have to be given explicit names that dynamic library loaders can use to identify them. That means there's a lot of boilerplate, linking named functions to template specializations:

https://github.com/scikit-hep/awkward-1.0/blob/029ae3991849d391733043db5e87389e333638fc/src/cpu-kernels/operations.cpp#L1147-L1162

On the other side of the extern "C" interface, we curb this insanity by gathering them back up again into template instantiations that are easier to use in the C++ codebase:

https://github.com/scikit-hep/awkward-1.0/blob/029ae3991849d391733043db5e87389e333638fc/src/libawkward/kernel.cpp#L7073-L7095

(so the number of differently named functions explodes only at the interface). Why do we need an extern "C" interface if it's causing all this boilerplate? First of all, because of that ptr_lib == kernel::lib::cuda case in the above code; it lets us swap CPU-bound implementations for CUDA implementations at runtime, based on whether we get the array from NumPy or from CuPy. (The project we're looking at here is just the CPU-bound implementation—we're trying to auto-generate CUDA kernels from the CPU ones, and whatever you write would be part of that automatic translation.) Beyond swapping out implementations below the extern "C" interface, it also lets us swap out implementations above it, if someone's interested in building a Julia/Rust/Swift/whatever implementation of Awkward Array.

The only extra issue that complex numbers introduces is that std::complex is not extern "C" compatible, so you'd have to reinterpret_cast a std::complex<double> array of length n as a double array of length n * 2 before passing it through the interface. C++ types like std::complex can be used on both sides of the extern "C" interface, just not through it. Thus, you don't need to define type conversion or complex addition/multiplication inside the cpu-kernels, you can reinterpret_cast that double array of length n * 2 as a std::complex<double> array of length n and then the compiler takes over.

But I apologize in advance about the boilerplate. I've been generating code with Emacs keyboard macros...

sjperkins commented 4 years ago

xref complex64 + complex128 support in Arrow: https://issues.apache.org/jira/browse/ARROW-638

jpivarski commented 4 years ago

My reading of that is that it's legal in the format, but hasn't necessarily been implemented yet. The discussion in the comments are just ideas about how to do it in C++; maybe this is followed up on in another issue.

Nevertheless, I need to get complex numbers and datetimes into Awkward anyway. When C++ Arrow-Parquet is ready (maybe already), then we can take advantage of it, but these types would be useful in Awkward just for computation, regardless of whether they can be saved into Parquet files.

jpivarski commented 3 years ago

Hi @sjperkins! Let me know if you're still working on this (complex numbers in Awkward Array).

If not, I'll move it out of "in progress" and see when I can find some time to work on it.

Thanks!

sjperkins commented 3 years ago

Hi @jpivarski.

Briefly, I'm not working on this currently (I'm away from my laptop), but would like to reconnect and discuss what I'll be working in in the new year and it's relation to awkward.

I should be back from 4th.

sjperkins commented 3 years ago

Here's the python file I was using to stamp out the headers, implementations and kernels. It may be useful.

import argparse
from itertools import product
from string import Template

np_to_cpp_map = {
    'bool': 'bool',
    'int8': 'int8_t',
    'int16': 'int16_t',
    'int32': 'int32_t',
    'int64': 'int64_t',

    'uint8': 'uint8_t',
    'uint16': 'uint16_t',
    'uint32': 'uint32_t',
    'uint64': 'uint64_t',

    'float32': 'float',
    'float64': 'double',
    'complex64': 'std::complex<float>',
    'complex128': 'std::complex<double>'
}

kernel_fill_template = Template('''template <>
    ERROR NumpyArray_fill<${FROM}, ${TO}>(
        kernel::lib ptr_lib,
        ${TO} *toptr,
        int64_t tooffset,
        const ${FROM} *fromptr,
        int64_t length) {
        if (ptr_lib == kernel::lib::cpu) {
            return awkward_NumpyArray_fill_to${NPTO}_from${NPFROM}(
                toptr,
                tooffset,
                fromptr,
                length);
        }
        else if (ptr_lib == kernel::lib::cuda) {
            throw std::runtime_error(
                std::string("not implemented: ptr_lib == cuda_kernels "
                            "for NumpyArray_fill<${TO}, ${FROM}>"
                            + FILENAME(__LINE__)));
        }
        else {
            throw std::runtime_error(
                std::string("unrecognized ptr_lib "
                            "for NumpyArray_fill<${TO}, ${FROM}>"
                            + FILENAME(__LINE__)));
        }
    }''')

oph_fill_template = Template('''/// @param toptr outparam
  /// @param tooffset inparam role: IndexedArray-index-offset
  /// @param fromptr inparam role: NumpyArray-ptr
  /// @param length inparam
  EXPORT_SYMBOL struct Error
  awkward_NumpyArray_fill_to${NPTO}_from${NPFROM}(
    ${TO} * toptr,
    int64_t tooffset,
    const ${FROM} * fromptr,
    int64_t length);''')

opcpp_file_template = Template('''    ERROR
    awkward_NumpyArray_fill_to${NPTO}_from${NPFROM}(${TO}* toptr,
                                            int64_t tooffset,
                                            const ${FROM}* fromptr,
                                            int64_t length) {
    return awkward_NumpyArray_fill(toptr, tooffset, fromptr, length);
    }''')

def create_parser():
    p = argparse.ArgumentParser()
    p.add_argument("file", choices=["oph", "opcpp", "kernel"])
    return p

if __name__ == "__main__":
    args = create_parser().parse_args()

    from_to = list(product(np_to_cpp_map.keys(), np_to_cpp_map.keys()))

    for np_from_type, np_to_type in from_to:
        cpp_from_type = np_to_cpp_map[np_from_type]
        cpp_to_type = np_to_cpp_map[np_to_type]

        if args.file == "oph":
            template = oph_fill_template
        elif args.file == "opcpp":
            template = opcpp_file_template
        elif args.file == "kernel":
            template = kernel_fill_template
        else:
            raise ValueError(f"Invalid output file {args.file}")

        fn = template.substitute(TO=cpp_to_type, FROM=cpp_from_type,
                                 NPTO=np_to_type, NPFROM=np_from_type)

        print(fn)

jpivarski commented 3 years ago

This should have been linked to #652: it's done now!

scikit-hep / awkward

Complex numbers (in ak.from_iter and elsewhere) #392