Enable duckdb index on gated datasets

huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

https://huggingface.co/docs/dataset-viewer

Apache License 2.0

675 stars 73 forks source link

Enable duckdb index on gated datasets #1686

Closed severo closed 1 year ago

severo commented 1 year ago

Currently, duckdb index is not supported for gated/private datasets. I opened a question in duckdb foundations but didn't receive a response yet, I think I will open an issue in the repo https://github.com/duckdb/foundation-discussions/discussions/16

from Slack (internal)

AndreaFrancis commented 1 year ago

I was thinking that maybe we can use the same approach as split-descriptive-statistics, temporarily download the parquets using auth for gated datasets and then load it to the duckdb index.

severo commented 1 year ago

Yes, maybe it's even faster that way. Did you compare?

AndreaFrancis commented 1 year ago

I compared and looks to be similar, with a difference of a couple of seconds:

URL	Size MB	Time (seconds) - download	Time (seconds) - duckdb
https://huggingface.co/datasets/amazon_us_reviews/resolve/refs%2Fconvert%2Fparquet/Baby_v1_00/amazon_us_reviews-train-00000-of-00002.parquet	274	188.66	161.86
https://huggingface.co/datasets/asoria/copy_beans/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet	17.7	8.33	8.13
https://huggingface.co/datasets/IlyaGusev/yandex_q_full/resolve/refs%2Fconvert%2Fparquet/default/partial-train/0008.parquet	260	88.7	96

I think we can add an exception in case it was not possible to use duckdb to load the data and try to download the parquet files.

severo commented 1 year ago

The difference is small. I would say: until the auth is available in duckdb, it's better to just switch to downloading the files. The code will be easier to maintain.

AndreaFrancis commented 1 year ago

julien-c commented 1 year ago

BTW re. DuckDB maybe you can send a quick email too (on the thread we had) to make sure it's on their radar

julien-c commented 1 year ago

@AndreaFrancis very interesting table – can you split the "download" case in download and query?

julien-c commented 1 year ago

oh actually nevermind, i thought this was about query time (ie. at search time), between downloading + querying locally, vs. querying remotely

For the workers i think we don't care that much about perf

julien-c commented 1 year ago

querying remotely

and i think we cannot do this anyway, the index needs to be fully local