nomic-ai / nomic

Interact, analyze and structure massive text, image, embedding, audio and video datasets
https://atlas.nomic.ai
1.35k stars 173 forks source link

added connector folder and HF file #313

Open abhisomala opened 4 months ago

abhisomala commented 4 months ago

HF file takes in any Huggingface identifier and then returns an AtlasDataset

Updates:

Testing:

Limitations:

:rocket: This description was created by Ellipsis for commit 9ae14f42ed2de9aad1386f93bf75883349eb6b6c

Summary:

Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.

Key points:


Generated with :heart: by ellipsis.dev

apage43 commented 4 months ago

Loads dataset using load_dataset library and assigns a unique ULID ID for each entry

(andriy) These should be ULID not sh256 or uuids. @apage43 has a nice function for these

The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big

Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up

When using map_data the nomic client already has functionality to create a sequential ID field (note that its still required to be a string so it base64s their binary representation), it may make sense to copy that behavior. See here: https://github.com/nomic-ai/nomic/blob/1f042befc53892271bd0a0877070d47b2d3cb631/nomic/atlas.py#L77


Tested with datasets smaller than 10k for speed but can work with larger datasets

I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with load_dataset("really-big-dataset", split="train[:100000]") to only get the first 100k rows.

Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming

here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass streaming=True to load_dataset to get an IterableDataset that only downloads as much as you actually read, which may be desirable if you're only working with a subset of a large dataset): https://gist.github.com/apage43/9e80b0f4378ed466ec5d1c0a4042c398

RLesser commented 4 months ago

Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets.

apage43 commented 4 months ago

current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended?