rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.41k stars 899 forks source link

[FEA] Support imaginary numbers in cuDF dataframes #11983

Open arpan-das-astrophysics opened 2 years ago

arpan-das-astrophysics commented 2 years ago

Hello, I am trying to store some imaginary numbers as a cudf dataframe column. Each column cell is a list of imaginary numbers. I am wondering what would be the most efficient way to do it as cudf dataframe doesn't support imaginary numbers? Seperating the real and imaginary part is what I am doing now but this is a huge dataset and it is taking a lot of time.

shwina commented 2 years ago

Seperating the real and imaginary part is what I am doing now but this is a huge dataset and it is taking a lot of time.

As we don't support a complex type in cuDF, unfortunately that's the best approach I can think of as well. I understand that things are further complicated because what you want is a List[complex] data type.

Can you share an example of the kind of operations you wish to perform? Perhaps we can suggest adequate workarounds.

shwina commented 2 years ago

I imagine this is yet another use-case for something like Awkward Arrays on the GPU. FYI @gmarkall, and also I hope you don't mind the cc @jpivarski :)

arpan-das-astrophysics commented 2 years ago

I have this dataformat where the "DATA" column values are list of imaginary numbers. I was trying to store them in similar way in cudf dataframe. I think I found another workaround, which is to convert the complex128 to strings so that cudf reads them as a list of strings and not complex numbers and when I read them back for some operation I convert them back to complex numbers.

Screen Shot 2022-10-25 at 4 21 28 PM

GregoryKimball commented 1 year ago

Hello @arpan-das-astrophysics you mentioned storing imaginary numbers as floats in #12104 and as strings here in #11983. Would you please share a bit more about the processing steps you would like to apply to the List[complex] data?

arpan-das-astrophysics commented 1 year ago

Hello @arpan-das-astrophysics you mentioned storing imaginary numbers as floats in #12104 and as strings here in #11983. Would you please share a bit more about the processing steps you would like to apply to the List[complex] data?

Hi Gregory, thank you for looking into this. Initially I was using List[complex] however that is still memory efficient conversion. The best way I found is to cast the complex array into floats of adjacent (real,imag) which in principle shouldn't take any additional memory. I used np.view(float32) to cast the array into adjacent floats and then tolist() to store it in the dataframe. However, we are reading this column multiple times for several operations and it is not optimal to cast and recast every time. So it would be great if we can directly store the whole complex array without any conversion.

arpan-das-astrophysics commented 1 year ago

Hi @GregoryKimball any update on this?

As an example, I have to run a big for loop to extract columns from a data frame where some columns are multidimensional array and some are even with complex numbers. As you can see extracting those columns making the for loop significantly and this is blocking the whole purpose of using cudF dataframe:

Screen Shot 2022-12-01 at 5 47 37 PM Screen Shot 2022-12-01 at 5 48 43 PM