Closed severo closed 1 year ago
I was thinking that maybe we can use the same approach as split-descriptive-statistics, temporarily download the parquets using auth for gated datasets and then load it to the duckdb index.
Yes, maybe it's even faster that way. Did you compare?
I compared and looks to be similar, with a difference of a couple of seconds:
URL | Size MB | Time (seconds) - download | Time (seconds) - duckdb |
---|---|---|---|
https://huggingface.co/datasets/amazon_us_reviews/resolve/refs%2Fconvert%2Fparquet/Baby_v1_00/amazon_us_reviews-train-00000-of-00002.parquet | 274 | 188.66 | 161.86 |
https://huggingface.co/datasets/asoria/copy_beans/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet | 17.7 | 8.33 | 8.13 |
https://huggingface.co/datasets/IlyaGusev/yandex_q_full/resolve/refs%2Fconvert%2Fparquet/default/partial-train/0008.parquet | 260 | 88.7 | 96 |
I think we can add an exception in case it was not possible to use duckdb to load the data and try to download the parquet files.
The difference is small. I would say: until the auth is available in duckdb, it's better to just switch to downloading the files. The code will be easier to maintain.
BTW re. DuckDB maybe you can send a quick email too (on the thread we had) to make sure it's on their radar
@AndreaFrancis very interesting table – can you split the "download" case in download
and query
?
oh actually nevermind, i thought this was about query time (ie. at search time), between downloading + querying locally, vs. querying remotely
For the workers i think we don't care that much about perf
querying remotely
and i think we cannot do this anyway, the index needs to be fully local
from Slack (internal)