Open NickCrews opened 7 months ago
Are you interested in these abstractions
Quite possibly. :) But I'm trying to figure out...
should we try to use these more throughout this lib?
How would we use them, in practice? They're currently defined on map or array types, but most of the transformations in Ibis-ML as-is assume data is separated by column.
As a practical example, I'm currently POCing PCA. For the transform bit (fit is out of scope, handled by scikit-learn or something), you need matrix multiplication. Let's say we do it on the numeric columns of the penguins dataset, so bill_length_mm
, bill_depth_mm
, flipper_length_mm
, and body_mass_g
.
If I have 3 components, I'll have a 4x3 ndarray to multiply against. There are (at least) two possibilities:
c1
, c2
, and c3
. The SQL will look semi-ugly, but any database should handle the 3 summed multiplications with ease. Perhaps it's a minor annoyance that I have 3 new columns in the output, or is it a feature? :)list_dot_product
(as in the case of DuckDB), this could be nice? I'm not 100% sure if it would be more performant, but can benchmark it. But what do we do for all the backends that don't support this? Maybe we implement a UDF, but that would almost certainly be slower than option 1 (right? 😅).Obviously, your code opens up the possibility of multiplying unknown vectors, and vectors of different lengths, but need to figure out where this would be used directly. I suppose support for normalization would be an obvious one. Also, didn't touch on the use cases where sparse support would be nice.
Do you have thoughts on where this would best be leveraged? Any use cases you're facing that it would help address?
3 new columns in the output
I think from a UX perspective I would prefer everything to be nicely packaged in a column of arrays. For 3 columns it feels like a toss-up. I don't have a whole lot of experience, but isn't it more common for PCA to go on the order of 1000s of dimensions -> 10s or 100s of dimensions? I would be annoyed with 50 columns in my table just from a preview perspective, especially because after PCA those columns are in some uninterpretable latent space so I'm not really going to want to inspect the values very often.
or is it a feature?
Accessing the end results as components[4]
instead of component_4
feels like a tossup to me. I might prefer the components[4]
since it is easier to do programatically.
I'm not 100% sure if it would be more performant
My first thought is if there is a builtin list_dot_product(a, b)
, that is going to be more/same performant than a_0 * b_0 + a_1 * b_1 ...
backends that don't support this
Could we just fall back to a[0] * b[0] + a[1] * b[1] ...
? Basically the same manual SQL you have now, but instead of pulling from different columns, we pull from array elements?
Any use cases you're facing that it would help address?
In https://github.com/NickCrews/mismo/commit/df57a37642edeb41d945a000b1f5a6228b4d72c1 I am working towards a TF-IDF transformer for text (eg to compare two documents based on their token overlap). That requires a sparse vector representation because the vocabulary of possible words/tokens/ngrams is huge.
Followup if we want really optimized string support in ibisml: It's interesting that spacy optimizes this, and has an extra layer of indirection, so that {"dog": 3, "cat": 2"} is actually stored as {745: 3, 934: 2}
, and then there is a separate global vocab lookup that looks like {745: "dog", 934: "cat"}
. This means that the string "dog" only needs to get stored once, and then each vector uses much cheaper integer keys.
performance
Lets assume there are features a to z, and we have documents 0 to N. I'm not positive here, but my understanding that for all columnar-store backends (which is what we are targeting here, IDK do we want people do be doing analytics in postgres? would be much better to encourage users to use duckdb to read the postgres and then do the calculations in duckdb), the memory layout for these different implementations would be
a0 a1 ... aN
, and then in a different place in memory b0 b1 .. bN
.a0 b0 ... z0 a1 b1 c1 ... zN aN bN cN
, and then another contiguous range holding the index offsets. I'm guessing this difference in memory layout would be the main cause for any performance differences, but I'm not sure. This would be worth asking some duckdb engineer, and then benchmarking to confirm.
My ideas for conclusions:
I don't have a whole lot of experience, but isn't it more common for PCA to go on the order of 1000s of dimensions -> 10s or 100s of dimensions?
Fair point. I think for visualization you may use a small number of components, but more generally for ML you'd use a high number (until it stops yielding value).
My first thought is if there is a builtin
list_dot_product(a, b)
, that is going to be more/same performant thana_0 * b_0 + a_1 * b_1 ...
FYI the reason I'm not 100% sure, is because list_dot_product
is more flexible; one row can have a
and b
of length 4, another can have a
and b
of length 400.
Any use cases you're facing that it would help address?
In https://github.com/NickCrews/mismo/commit/df57a37642edeb41d945a000b1f5a6228b4d72c1 I am working towards a TF-IDF transformer for text (eg to compare two documents based on their token overlap). That requires a sparse vector representation because the vocabulary of possible words/tokens/ngrams is huge.
IMO this is something that could be particularly interesting to eventually add in Ibis-ML! :)
My ideas for conclusions:
- if we want to support text features, we need some sort of sparse vector representation. Maps seem like the obvious way?
- For dense vectors, perhaps we port over more sklearn features and see how they feel API wise, as well as benchmark them, before making a decision.
I think these conclusions make sense! Let me also see if I can scrounge up a couple more opinions.
@tonysun9 would love to get some feedback from you since we previously discussed some NLP ideas. @NickCrews @deepyaman how do you both envision Ibis abstracting vector/embedding storage either as a dependency/requirement for this or how we intend to work on use cases that's bigger than memory limitations?
Looks like a good start! Putting down some quick thoughts.
The dot
function looks good to me. Quick note (just food for thought):
@
operator for matrix multiplication. When applied to vectors, this returns the dot product.I'm used to seeing the norm
function return a scalar when applied to a 1D vector. e.g. numpy, torch
I'm used to seeing the norm function return a scalar
I agree I think a picked a bad name. My function above should be called normalize
. There should probably be a new function called norm()
(or magnitude
etc to avoid confusing these two?) that would return a scalar.
matrices
No idea how to implement this in Ibis. matrices sort of want there to be some symmetry between rows and columns. At best I see matrices as being awkward to represent, and at worst (and I see this as likely) being very inefficient to implement in SQL.
IDK, maybe this is where arrow UDFs come to the rescue? matrices are represented as ArrayColumn
s in ibis, which turn into arrow arrays of Lists, then the computation happens (I assume this memory layout would be reasonable to manipulate?), then it gets passed back to ibis where it is restored to an ArrayColumn
?
I have these existing vector abstractions in ibis:
Also have tests. Are you interested in these abstractions, should we try to use these more throughout this lib? I have some tests too. I can submit a PR if so.