nomic-ai / nomic

Interact, analyze and structure massive text, image, embedding, audio and video datasets
https://atlas.nomic.ai
1.27k stars 168 forks source link

Matryoshka and Binary Options for Sagemaker #296

Closed rguo123 closed 5 months ago

rguo123 commented 5 months ago

Adds in options to get binary or reduced dimensions from returned sagemaker embeddings. Updates example notebook to show binary and matryoshka options.

Note: I did not add this in for batch transform as it would require the user to know how to format their csvs to pass the parameters.

rguo123 commented 5 months ago

@zanussbaum we can add! There is no csv standard from what I can tell. We just need to be able to handle the expected format. It's already somewhat unique from old sagemaker given that a user has to know to upload a single column csv of texts in our case.

What I can add is an option to have the first row be columns of parameter values. We'll check for this condition and if true, we'll attempt to parse the dimensionality and binary option. If it fails, we can catch the exception and treat is as a text row.

How does that sound to you?

zanussbaum commented 5 months ago

yeah that sounds great! my only request is that we document this somewhere either in the doc strings or in an example readme somewhere so it's clear :D