huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
674 stars 73 forks source link

Create a special column type when it contains PDF bytes or PDF URL #2991

Open severo opened 1 month ago

severo commented 1 month ago

In that case, we would generate an image (thumbnail of the first page), stored as an asset, to populate /first-rows and /rows and display in the dataset viewer.

asked internally on Slack: https://huggingface.slack.com/archives/C064HCHEJ2H/p1721215883166569 cc @Pleias

lhoestq commented 1 month ago

The priority is to have the PDF type detection and thumbnail IMO.

One way to tackle this is to add the PDF type detection in datasets for the bytes case. This way it will be easy to:

Then for the URL case we can extend the image URL detection in step in the viewer, but I'm not sure if it's possible to render a thumbnail of a PDF in JS from a URL ?

severo commented 1 month ago

I'm not sure if it's possible to render a thumbnail of a PDF in JS from a URL

good point, we can't do the same here.

severo commented 1 month ago

I opened https://github.com/huggingface/datasets/issues/7058