mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
456 stars 41 forks source link

Invalid object type for field "distribution" #725

Open pdurbin opened 3 months ago

pdurbin commented 3 months ago

Not long after we deployed our Croissant implementation to Harvard Dataverse, we got this email about Invalid object type for field "distribution" from the Google Search Console Team:

Screenshot 2024-08-28 at 3 44 15 PM

I followed the link...

Screenshot 2024-08-28 at 3 45 14 PM

... and tried the first example URL at https://validator.schema.org/#url=https%3A%2F%2Fdataverse.harvard.edu%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.7910%2FDVN%2FRU1X7W

It does show errors around the "distribution" field. Specifically: http://mlcommons.org/croissant/FileObject is not a known valid target type for the distribution property. It looks like this:

Screenshot 2024-08-28 at 3 46 10 PM

Did I do something wrong with my Croissant implementation?

Does the Google Search Console not (yet) know about Croissant? 🥐

Thanks!

benjelloun commented 3 months ago

Hi Phil,

Indeed the Search console doesn't know about Croissant yet. It only validates mark-up based on the schema.org vocabulary, which expects distribution to be of type sc:DataDownload. I will get in touch with them to figure out how to best address this issue.

Best, Omar

pdurbin commented 1 week ago

Thanks, this validation error was noticed here as well: https://github.com/iodepo/odis-arch/issues/481#issuecomment-2482351591

pbuttigieg commented 1 week ago

Thanks, this validation error was noticed here as well: iodepo/odis-arch#481 (comment)

Here's the proposed fix for distribution.

Type arrays may resolve other such issues. The overarching issue is that Croissant should make sure not to rough up existing types and their required properties with its (very useful) extended properties.