ibis-project / ibis-ml

IbisML is a library for building scalable ML pipelines using Ibis.
https://ibis-project.github.io/ibis-ml/
Apache License 2.0
88 stars 13 forks source link

interest in vector abstractions? #26

Open NickCrews opened 7 months ago

NickCrews commented 7 months ago

I have these existing vector abstractions in ibis:

```python """Vector operations""" from __future__ import annotations from typing import Literal, TypeVar import ibis import ibis.expr.types as ir @ibis.udf.scalar.builtin(name="array_sum") def _array_sum(a) -> float: ... # for duckdb @ibis.udf.scalar.builtin(name="list_dot_product") def _array_dot_product(a, b) -> float: ... T = TypeVar("T", ir.MapValue, ir.ArrayValue) def dot(a: T, b: T) -> ir.FloatingValue: """Compute the dot product of two vectors The vectors can either be dense vectors, represented as array, or sparse vectors, represented as map. Both vectors must be of the same type though. Parameters ---------- a : The first vector. b : The second vector. Returns ------- FloatingValue The dot product of the two vectors. Examples -------- >>> import ibis >>> from mismo._compute import dot >>> v1 = ibis.array([1, 2]) >>> v2 = ibis.array([4, 5]) >>> dot(v1, v2) 14.0 # 1*4 + 2*5 >>> m1 = ibis.map({"a": 1, "b": 2}) >>> m2 = ibis.map({"b": 3, "c": 4}) >>> dot(m1, m2) 6.0 # 2*3 """ if isinstance(a, ir.ArrayValue) and isinstance(b, ir.ArrayValue): a_vals = a b_vals = b elif isinstance(a, ir.MapValue) and isinstance(b, ir.MapValue): keys = a.keys().intersect(b.keys()) a_vals = keys.map(lambda k: a[k]) b_vals = keys.map(lambda k: b[k]) else: raise ValueError(f"Unsupported types {type(a)} and {type(b)}") return _array_dot_product(a_vals, b_vals) def norm(arr: T, metric: Literal["l1", "l2"] = "l2") -> T: """Normalize a vector to have unit length. The vector can either be a dense vector, represented as array, or a sparse vector, represented as map. The returned vector will have the same type as the input vector. Parameters ---------- arr : The vector to normalize. metric : {"l1", "l2"}, default "l2" The metric to use. "l1" for Manhattan distance, "l2" for Euclidean distance. Returns ------- ArrayValue The normalized vector. Examples -------- >>> import ibis >>> ibis.options.interactive = True >>> from mismo._compute import norm >>> norm(ibis.array([1, 2])) [0.4472135954999579, 0.8944271909999159] >>> norm(ibis.array([1, 2]), metric="l1") [0.3333333333333333, 0.6666666666666666] >>> norm(ibis.map({"a": 1, "b": 2})) {"a": 0.4472135954999579, "b": 0.8944271909999159} """ if isinstance(arr, ir.ArrayValue): vals = arr elif isinstance(arr, ir.MapValue): vals = arr.values() else: raise ValueError(f"Unsupported type {type(arr)}") if metric == "l1": denom = _array_sum(vals) elif metric == "l2": denom = _array_sum(vals.map(lambda x: x**2)).sqrt() else: raise ValueError(f"Unsupported norm {metric}") normed_vals = vals.map(lambda x: x / denom) if isinstance(arr, ir.ArrayValue): return normed_vals else: return ibis.map(arr.keys(), normed_vals) ```

Also have tests. Are you interested in these abstractions, should we try to use these more throughout this lib? I have some tests too. I can submit a PR if so.

deepyaman commented 7 months ago

Are you interested in these abstractions

Quite possibly. :) But I'm trying to figure out...

should we try to use these more throughout this lib?

How would we use them, in practice? They're currently defined on map or array types, but most of the transformations in Ibis-ML as-is assume data is separated by column.

As a practical example, I'm currently POCing PCA. For the transform bit (fit is out of scope, handled by scikit-learn or something), you need matrix multiplication. Let's say we do it on the numeric columns of the penguins dataset, so bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g.

If I have 3 components, I'll have a 4x3 ndarray to multiply against. There are (at least) two possibilities:

  1. What I am currently doing: Generate SQL for the three resulting columns, c1, c2, and c3. The SQL will look semi-ugly, but any database should handle the 3 summed multiplications with ease. Perhaps it's a minor annoyance that I have 3 new columns in the output, or is it a feature? :)
  2. What could be done with the code you shared: Condense the 4 input columns into an array (or map), and multiply by an array/map literal constructed from the components. If the backend supports list_dot_product (as in the case of DuckDB), this could be nice? I'm not 100% sure if it would be more performant, but can benchmark it. But what do we do for all the backends that don't support this? Maybe we implement a UDF, but that would almost certainly be slower than option 1 (right? 😅).

Obviously, your code opens up the possibility of multiplying unknown vectors, and vectors of different lengths, but need to figure out where this would be used directly. I suppose support for normalization would be an obvious one. Also, didn't touch on the use cases where sparse support would be nice.

Do you have thoughts on where this would best be leveraged? Any use cases you're facing that it would help address?

NickCrews commented 6 months ago

3 new columns in the output

I think from a UX perspective I would prefer everything to be nicely packaged in a column of arrays. For 3 columns it feels like a toss-up. I don't have a whole lot of experience, but isn't it more common for PCA to go on the order of 1000s of dimensions -> 10s or 100s of dimensions? I would be annoyed with 50 columns in my table just from a preview perspective, especially because after PCA those columns are in some uninterpretable latent space so I'm not really going to want to inspect the values very often.

or is it a feature?

Accessing the end results as components[4] instead of component_4 feels like a tossup to me. I might prefer the components[4] since it is easier to do programatically.

I'm not 100% sure if it would be more performant

My first thought is if there is a builtin list_dot_product(a, b), that is going to be more/same performant than a_0 * b_0 + a_1 * b_1 ...

backends that don't support this

Could we just fall back to a[0] * b[0] + a[1] * b[1] ...? Basically the same manual SQL you have now, but instead of pulling from different columns, we pull from array elements?

Any use cases you're facing that it would help address?

In https://github.com/NickCrews/mismo/commit/df57a37642edeb41d945a000b1f5a6228b4d72c1 I am working towards a TF-IDF transformer for text (eg to compare two documents based on their token overlap). That requires a sparse vector representation because the vocabulary of possible words/tokens/ngrams is huge.

Followup if we want really optimized string support in ibisml: It's interesting that spacy optimizes this, and has an extra layer of indirection, so that {"dog": 3, "cat": 2"} is actually stored as {745: 3, 934: 2}, and then there is a separate global vocab lookup that looks like {745: "dog", 934: "cat"}. This means that the string "dog" only needs to get stored once, and then each vector uses much cheaper integer keys.

performance

Lets assume there are features a to z, and we have documents 0 to N. I'm not positive here, but my understanding that for all columnar-store backends (which is what we are targeting here, IDK do we want people do be doing analytics in postgres? would be much better to encourage users to use duckdb to read the postgres and then do the calculations in duckdb), the memory layout for these different implementations would be

I'm guessing this difference in memory layout would be the main cause for any performance differences, but I'm not sure. This would be worth asking some duckdb engineer, and then benchmarking to confirm.

NickCrews commented 6 months ago

My ideas for conclusions:

deepyaman commented 6 months ago

I don't have a whole lot of experience, but isn't it more common for PCA to go on the order of 1000s of dimensions -> 10s or 100s of dimensions?

Fair point. I think for visualization you may use a small number of components, but more generally for ML you'd use a high number (until it stops yielding value).

My first thought is if there is a builtin list_dot_product(a, b), that is going to be more/same performant than a_0 * b_0 + a_1 * b_1 ...

FYI the reason I'm not 100% sure, is because list_dot_product is more flexible; one row can have a and b of length 4, another can have a and b of length 400.

Any use cases you're facing that it would help address?

In https://github.com/NickCrews/mismo/commit/df57a37642edeb41d945a000b1f5a6228b4d72c1 I am working towards a TF-IDF transformer for text (eg to compare two documents based on their token overlap). That requires a sparse vector representation because the vocabulary of possible words/tokens/ngrams is huge.

IMO this is something that could be particularly interesting to eventually add in Ibis-ML! :)

deepyaman commented 6 months ago

My ideas for conclusions:

  • if we want to support text features, we need some sort of sparse vector representation. Maps seem like the obvious way?
  • For dense vectors, perhaps we port over more sklearn features and see how they feel API wise, as well as benchmark them, before making a decision.

I think these conclusions make sense! Let me also see if I can scrounge up a couple more opinions.

zhenzhongxu commented 6 months ago

@tonysun9 would love to get some feedback from you since we previously discussed some NLP ideas. @NickCrews @deepyaman how do you both envision Ibis abstracting vector/embedding storage either as a dependency/requirement for this or how we intend to work on use cases that's bigger than memory limitations?

tonysun9 commented 6 months ago

Looks like a good start! Putting down some quick thoughts.

The dot function looks good to me. Quick note (just food for thought):

I'm used to seeing the norm function return a scalar when applied to a 1D vector. e.g. numpy, torch

NickCrews commented 6 months ago

I'm used to seeing the norm function return a scalar

I agree I think a picked a bad name. My function above should be called normalize. There should probably be a new function called norm() (or magnitude etc to avoid confusing these two?) that would return a scalar.

matrices

No idea how to implement this in Ibis. matrices sort of want there to be some symmetry between rows and columns. At best I see matrices as being awkward to represent, and at worst (and I see this as likely) being very inefficient to implement in SQL.

IDK, maybe this is where arrow UDFs come to the rescue? matrices are represented as ArrayColumns in ibis, which turn into arrow arrays of Lists, then the computation happens (I assume this memory layout would be reasonable to manipulate?), then it gets passed back to ibis where it is restored to an ArrayColumn?