huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
672 stars 72 forks source link

Should we support /filter on columns that contain SQL commands? #2649

Open severo opened 5 months ago

severo commented 5 months ago

See the schema column on https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k. Clicking on any of the 'classes' leads to an error

Capture d’écran 2024-03-28 à 15 11 50

The erroneous URL is:

https://datasets-server.huggingface.co/filter?dataset=motherduckdb%2Fduckdb-text2sql-25k&config=default&split=train&offset=0&length=100&where=schema%3D%27CREATE+TABLE+%22venue%22+%28%0A++%22venueId%22+INTEGER+NOT+NULL%2C%0A++%22venueName%22+VARCHAR%28100%29%2C%0A++%22venueInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22venueId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22author%22+%28%0A++%22authorId%22+INTEGER+NOT+NULL%2C%0A++%22authorName%22+VARCHAR%2850%29%2C%0A++%22authorPublications%22+INT%5B%5D%2C%0A++PRIMARY+KEY+%28%22authorId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22dataset%22+%28%0A++%22datasetId%22+INTEGER+NOT+NULL%2C%0A++%22datasetName%22+VARCHAR%2850%29%2C%0A++%22datasetInfo%22+STRUCT%28v+VARCHAR%2C+i+INTEGER%29%2C%0A++PRIMARY+KEY+%28%22datasetId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22journal%22+%28%0A++%22journalId%22+INTEGER+NOT+NULL%2C%0A++%22journalName%22+VARCHAR%28100%29%2C%0A++%22journalInfo%22+MAP%28INT%2C+DOUBLE%29%2C%0A++PRIMARY+KEY+%28%22journalId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22keyphrase%22+%28%0A++%22keyphraseId%22+INTEGER+NOT+NULL%2C%0A++%22keyphraseName%22+VARCHAR%2850%29%2C%0A++%22keyphraseInfo%22+VARCHAR%2850%29%5B%5D%2C%0A++PRIMARY+KEY+%28%22keyphraseId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22paper%22+%28%0A++%22paperId%22+INTEGER+NOT+NULL%2C%0A++%22title%22+VARCHAR%28300%29%2C%0A++%22venueId%22+INTEGER%2C%0A++%22year%22+INTEGER%2C%0A++%22numCiting%22+INTEGER%2C%0A++%22numCitedBy%22+INTEGER%2C%0A++%22journalId%22+INTEGER%2C%0A++%22paperInfo%22+UNION%28num+INT%2C+str+VARCHAR%29%2C%0A++PRIMARY+KEY+%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22journalId%22%29+REFERENCES+%22journal%22%28%22journalId%22%29%2C%0A++FOREIGN+KEY%28%22venueId%22%29+REFERENCES+%22venue%22%28%22venueId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22cite%22+%28%0A++%22citingPaperId%22+INTEGER+NOT+NULL%2C%0A++%22citedPaperId%22+INTEGER+NOT+NULL%2C%0A++%22citeInfo%22+INT%5B%5D%2C%0A++PRIMARY+KEY+%28%22citingPaperId%22%2C%22citedPaperId%22%29%2C%0A++FOREIGN+KEY%28%22citedpaperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22citingpaperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22paperDataset%22+%28%0A++%22paperId%22+INTEGER%2C%0A++%22datasetId%22+INTEGER%2C%0A++%22paperDatasetInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22datasetId%22%2C+%22paperId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22paperKeyphrase%22+%28%0A++%22paperId%22+INTEGER%2C%0A++%22keyphraseId%22+INTEGER%2C%0A++%22paperKeyphraseInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22keyphraseId%22%2C%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22paperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22keyphraseId%22%29+REFERENCES+%22keyphrase%22%28%22keyphraseId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22writes%22+%28%0A++%22paperId%22+INTEGER%2C%0A++%22authorId%22+INTEGER%2C%0A++%22writesInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22paperId%22%2C%22authorId%22%29%2C%0A++FOREIGN+KEY%28%22paperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22authorId%22%29+REFERENCES+%22author%22%28%22authorId%22%29%0A%29%3B%27

{"error":"Parameter 'where' contains invalid symbols"}

It's because the content includes some of the forbidden symbols:

https://github.com/huggingface/datasets-server/blob/4dddea2e6a476d52ba5be0c7c64fb8eca9827935/services/search/src/search/routes/filter.py#L53

Do you think it's possible to support the above query? Or should we handle the error on the Hub (not easy to do more than currently)?

severo commented 5 months ago

cc @albertvillanova for viz