Open abhisomala opened 4 months ago
Loads dataset using load_dataset library and assigns a unique ULID ID for each entry
(andriy) These should be ULID not sh256 or uuids. @apage43 has a nice function for these
The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big
Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up
When using map_data
the nomic client already has functionality to create a sequential ID field (note that its still required to be a string so it base64s their binary representation), it may make sense to copy that behavior. See here: https://github.com/nomic-ai/nomic/blob/1f042befc53892271bd0a0877070d47b2d3cb631/nomic/atlas.py#L77
Tested with datasets smaller than 10k for speed but can work with larger datasets
I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with load_dataset("really-big-dataset", split="train[:100000]")
to only get the first 100k rows.
Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming
here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass streaming=True
to load_dataset to get an IterableDataset that only downloads as much as you actually read, which may be desirable if you're only working with a subset of a large dataset): https://gist.github.com/apage43/9e80b0f4378ed466ec5d1c0a4042c398
Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets.
current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended?
HF file takes in any Huggingface identifier and then returns an AtlasDataset
Updates:
Testing:
Limitations:
Image files
Summary:
Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.
Key points:
connectors/huggingface_connector.py
.connectors/huggingface_connector.get_hfdata
to load datasets and handle configuration issues.connectors/huggingface_connector.hf_atlasdataset
to create anAtlasDataset
.connectors/huggingface_connector.convert_to_string
andconnectors/huggingface_connector.process_table
.connectors/huggingface_connector.py
.connectors/__init__.py
andexamples/HF_example_usage.py
.add_data
accepts arrow tables directly.Generated with :heart: by ellipsis.dev