Open pdurbin opened 9 months ago
Hi @pdurbin, thanks for creating this issue! The best way to crawl Hugging Face was to list all datasets and then visit each individual URL using their API.
https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP is perfect. How can I derive this URL for all other datasets on Dataverse? Do you have an API? Or should we crawl a website?
@marcenacp hi! We list all the datasets in a sitemap, so https://dataverse.harvard.edu/sitemap.xml contains this, for example:
<loc>https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP</loc>
Would that be enough? From there you could extract the DOI/Persistent ID and put into a (future) URL to download metadata in Croissant format. (I used exporter=schema.org
in the example above but we would create a new exporter for Croissant.)
Hi Phil,
It would be great to also embed the Croissant metadata inside the dataset Web pages (as an extension of the schema.org metadata you already have) so that it's crawlable by Search engines, and provide a download link / button for users.
@Pierre Marcenac @.***> For Croissant Health we should ideally favor the Croissant metadata directly present in Web pages rather than the one available via an API call. As we realized with HuggingFace, there can be some slight discrepancies between them.
Best, Omar
On Thu, Feb 22, 2024 at 9:00 PM Philip Durbin @.***> wrote:
@marcenacp https://github.com/marcenacp hi! We list all the datasets in a sitemap, so https://dataverse.harvard.edu/sitemap.xml contains this, for example:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP Would that be enough? From there you could extract the DOI/Persistent ID and put into a (future) URL to download metadata in Croissant format. (I used exporter=schema.org in the example above but we would create a new exporter for Croissant.)
— Reply to this email directly, view it on GitHub https://github.com/mlcommons/croissant/issues/530#issuecomment-1960180583, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMV3YRHUP2RYJLMJ2MSLS3YU6PU7AVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGE4DANJYGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@benjelloun thanks. Right, the schema.org metadata inside the <head>
of Dataverse web pages (what we'd call "dataset landing pages") was added to Dataverse in 2017 to support Google Dataset Search, which was new at the time.
I've been thinking that I'd leave that format alone (which we call schema.org
internally and "Schema.org JSON-LD" in our web interface) because I wouldn't want to break anything related to Google Dataset Search.
However, I know the Croissant team is collaborating with the Google Dataset Search team. If you're saying it's safe to modify the older format, extend it as you say, without breaking Google Dataset Search, that's great.
From a product perspective, I'm wondering if it would make sense to rebrand this format within Dataverse. As shown in the screenshot below, we call the older format "Schema.org JSON-LD" but perhaps we should simply change the label to "Croissant" if we proceed with a single, updated, extended format.
Alternatively, we could add Croissant as a new format (with an internal name of croissant
, probably), swap it into the <head>
(I assume we wouldn't want both formats in the <head>
!), and still offer the older format to download via API or a button click.
We'll probably know more once we start hacking and see that Croissant is truly an extension, with no backward-incompatible changes from the older format.
Perhaps Signposting would be a way to expose the URL for the desired format in a standardized way. Dataverse already supports signposting and we could nominally add new metadata formats to what is listed for level 2 metadata formats.
@qqmyers hmm, good point. Signposting does provide a machine-readable way (HTTP headers) to retrieve URLs with more information.
@benjelloun @marcenacp are you familiar with Signposting? What do you think?
To use my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP as an example, the "link" header looks like this:
link: <0000-0002-9528-9470>;rel="author", https://doi.org/10.7910/DVN/TJCLKP;rel="cite-as", https://dataverse.harvard.edu/api/access/datafile/6867328;rel="item";type="application/zip",https://dataverse.harvard.edu/api/access/datafile/6867331;rel="item";type="text/tab-separated-values",https://dataverse.harvard.edu/api/access/datafile/6867336;rel="item";type="text/x-python-script", https://doi.org/10.7910/DVN/TJCLKP;rel="describedby";type="application/vnd.citationstyles.csl+json",https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP;rel="describedby";type="application/ld+json", https://schema.org/AboutPage;rel="type",https://schema.org/Dataset;rel="type", http://creativecommons.org/publicdomain/zero/1.0;rel="license", https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/linkset?persistentId=doi:10.7910/DVN/TJCLKP ; rel="linkset";type="application/linkset+json"
On Mon, Feb 26, 2024 at 12:39 PM Philip Durbin @.***> wrote:
@qqmyers https://github.com/qqmyers hmm, good point. Signposting does provide a machine-readable way (HTTP headers) to retrieve URLs with more information.
@benjelloun https://github.com/benjelloun @marcenacp https://github.com/marcenacp are you familiar with Signposting https://signposting.org? What do you think?
Signposting looks interesting! At the moment, we do have restrictions on the crawling side that prevent us from reading metadata that is not directly embedded in the Webpage, so we can't rely on this approach in the near future unfortunately.
To use my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP as an example, the "link" header looks like this:
link: <0000-0002-9528-9470>;rel="author", https://doi.org/10.7910/DVN/TJCLKP;rel="cite-as", https://dataverse.harvard.edu/api/access/datafile/6867328 ;rel="item";type="application/zip", https://dataverse.harvard.edu/api/access/datafile/6867331 ;rel="item";type="text/tab-separated-values", https://dataverse.harvard.edu/api/access/datafile/6867336;rel="item";type="text/x-python-script", https://doi.org/10.7910/DVN/TJCLKP ;rel="describedby";type="application/vnd.citationstyles.csl+json", https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP;rel="describedby";type="application/ld+json", https://schema.org/AboutPage;rel="type",https://schema.org/Dataset;rel="type", http://creativecommons.org/publicdomain/zero/1.0;rel="license", https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/linkset?persistentId=doi:10.7910/DVN/TJCLKP ; rel="linkset";type="application/linkset+json"
— Reply to this email directly, view it on GitHub https://github.com/mlcommons/croissant/issues/530#issuecomment-1963943867, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMV3YQ3UKXSXFP3N4DX3DTYVRYAJAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTHE2DGOBWG4 . You are receiving this because you were mentioned.Message ID: @.***>
Hi Phil,
Please see replies inline.
On Fri, Feb 23, 2024 at 5:50 PM Philip Durbin @.***> wrote:
@benjelloun https://github.com/benjelloun thanks. Right, the schema.org metadata inside the
of Dataverse web pages (what we'd call "dataset landing pages") was added https://github.com/IQSS/dataverse/pull/4252 to Dataverse in 2017 to support Google Dataset Search, which was new at the time.Screenshot.2024-02-23.at.11.41.47.AM.png (view on web) https://github.com/mlcommons/croissant/assets/21006/d483b7c4-02a9-4ea4-b3d2-2c417a35a96b
I've been thinking that I'd leave that format alone (which we call schema.org internally and "Schema.org JSON-LD" in our web interface) because I wouldn't want to break anything related to Google Dataset Search.
However, I know the Croissant team is collaborating with the Google Dataset Search team. If you're saying it's safe to modify the older format, extend it as you say, without breaking Google Dataset Search, that's great.
I am also a member of the Google Dataset Search team :)
I would definitely encourage you to modify the older format. That should not break Dataset Search. In case there are any unforeseen issues, we will fix them with high priority.
From a product perspective, I'm wondering if it would make sense to rebrand this format within Dataverse. As shown in the screenshot below, we call the older format "Schema.org JSON-LD" but perhaps we should simply change the label to "Croissant" if we proceed with a single, updated, extended format.
I think that would be a fine change. I don't think "schema.org JSON-LD" as an export format is very useful, because that format is primarily targeted at Search Engines, while Croissant is meant to be used as a working format for ML datasets.
Alternatively, we could add Croissant as a new format (with an internal name of croissant, probably), swap it into the
(I assume we wouldn't want both formats in the !), and still offer the older format to download via API or a button click.You definitely want the Croissant description (which extends the schema.org/Dataset one) to be embedded in the dataset page, so that it can be picked up by Search engines like Dataset Search.
We'll probably know more once we start hacking and see that Croissant is truly an extension, with no backward-incompatible changes from the older format.
Sounds good! Please reach out if you have any questions.
This makes me think that we should probably write a short migration guide for dataset authors / platforms that already support schema.org/Dataset and would like to upgrade to Croissant.
Best, Omar
Screenshot.2024-02-23.at.11.35.24.AM.png (view on web) https://github.com/mlcommons/croissant/assets/21006/04c57d48-4364-45e9-b3f6-1abaea0d8fe2
— Reply to this email directly, view it on GitHub https://github.com/mlcommons/croissant/issues/530#issuecomment-1961665768, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMV3YRCAZIW6BS453JUPHLYVDCGZAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGY3DKNZWHA . You are receiving this because you were mentioned.Message ID: @.***>
Hi @benjelloun, Signposting is a child of Herbert van de Sompel, you can find the full spec here: https://signposting.org/FAIR/ We at DANS have contributed the initial version in Dataverse, lately it was extended and leveraged by Harvard and GDCC. I can bring Herbert here if you're interested. :)
@benjelloun @goeffthomas et al. thanks for the opportunity today to present to the Croissant Task Force the progress @4tikhonov and I have made toward supporting Croissant in Dataverse. Here are the talking points I was reading from as well as Slava's slides. (We also shared them with the Dataverse Google Group.)
As you suggested, I'll continue creating GitHub issues. And I'll come when I can to task force meetings. Thanks again!
@marcenacp you seem to have written most of the crawler. I got a weird error when I got to the scrapydweb
step: err.txt. Any advice? Thanks!
Oh, I did uncomment this early return because I don't want to crawl all of Hugging Face, if that matters:
# Uncomment this early return for debugging purposes:
return [
"lkarjun/Malayalam-Artiicles",
(I think the README might need to be updated, by the way. start_requests
was removed from huggingface.py
in 5264dcf.)
I opened a dedicated issue about scrapydweb
:
@pdurbin, just saw this comment. I answered on https://github.com/mlcommons/croissant/issues/647. Thanks!
Good news! We just enabled Croissant on Harvard Dataverse. I wrote up a little announcement.
From my perspective, this issue (#530) is now unblocked in that somebody could probably fork this repo and start trying to add a dataverse.py
script to health/crawler/spiders. But who should do that? Last I tried I had trouble (@marcenacp thanks for replying on #647).
The main thing to keep in mind is that there are 120+ installations of Dataverse out there. dataverse.py
should probably:
croissant
export format (e.g. https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP ) to know if Croissant is enabled or not. Alternatively, you could check the <head>
tag for Croissant metadata.Great news! Thank you Phil for pushing this through. I'm very excited about the wealth of datasets that will become available in Croissant thanks to Dataverse.
Best, Omar
On Thu, Aug 22, 2024 at 3:28 PM Philip Durbin @.***> wrote:
Good news! We just enabled https://github.com/IQSS/dataverse.harvard.edu/issues/294 Croissant on Harvard Dataverse. I wrote up a little announcement https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/dLmV7HTcAgAJ .
From my perspective, this issue (#530 https://github.com/mlcommons/croissant/issues/530) is now unblocked in that somebody could probably fork this repo and start trying to add a dataverse.py script to health/crawler/spiders https://github.com/mlcommons/croissant/tree/v1.0.7/health/crawler/spiders. But who should do that? Last I tried I had trouble @.*** https://github.com/marcenacp thanks for replying on #647 https://github.com/mlcommons/croissant/issues/647).
The main thing to keep in mind is that there are 120+ installations https://dataverse.org/installations of Dataverse out there. dataverse.py should probably:
- check the list of Dataverse installations in JSON https://iqss.github.io/dataverse-installations/data/data.json
- for each installation
- check the sitemap (e.g. https://dataverse.harvard.edu/sitemap.xml if it uses a single sitemap https://guides.dataverse.org/en/6.3/installation/config.html#single-sitemap-file or /sitemap/sitemap_index.xml if it uses multiple sitemaps https://guides.dataverse.org/en/6.3/installation/config.html#multiple-sitemap-files-sitemap-index-file )
- pick the first dataset and try to download the croissant export format (e.g. https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP ) to know if Croissant is enabled or not. Alternatively, you could check the tag for Croissant metadata.
- for the Dataverse installations where Croissant is enabled... do whatever health/crawler/spiders/huggingface.py or health/crawler/spiders/openml.py does to add those systems to Croissant Online Health. 😄
— Reply to this email directly, view it on GitHub https://github.com/mlcommons/croissant/issues/530#issuecomment-2305479611, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMV3YR7XBRAFYPECJVFQ23ZSY3UZAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVGQ3TSNRRGE . You are receiving this because you were mentioned.Message ID: @.***>
Nice work @pdurbin! BTW, as a little prework on the Kaggle <> Dataverse integration we've been looking at, I wrote a little notebook to figure out what version every installation is running: https://www.kaggle.com/code/goefft/check-dataverse-installation-versions
I just made it public in case it's of value to anyone who's looking into what you've suggested above. Basically, you can use this to target installations that are running >= 6.3 and they should have the exporter, right?
@benjelloun me too!
@goeffthomas nice notebook! You're right, I forgot to mention versions. Definitely any version lower than 6.0 should be excluded because the external exporter mechanism I'm using is not supported.
Dataverse 6.2 and higher puts the Croissant metadata in <head>
(https://github.com/IQSS/dataverse/pull/10382). But 6.0 and 6.1 both support the exporter. That is, URLs like https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP should work even on 6.0 and 6.1 if the croissant
exporter is enabled.
Hi! I'm interested in some details about how to add Dataverse to Croissant 🥐 Online Health.
In health/crawler/spiders/huggingface.py I'm seeing an example for Hugging Face like this:
https://datasets-server.huggingface.co/croissant?dataset=mnist
Is this roughtly what you need from us? Can the URL be different? A URL like the following would fit better into our existing pattern:
https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP
Thanks!