mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
443 stars 40 forks source link

[NeurIPS] Fileld has no source #686

Closed gorovuha closed 5 months ago

gorovuha commented 5 months ago

I upload my dataset in .csv through HuggingFace editor, but constantly face the mistake [Metadata(None) > RecordSet(***) > Field(***)] Node "***" is a field and has no source. Please, use http://mlcommons.org/croissant/source to specify the source. The link is corrupted. In RecordSets while editing 'fields details', I assign Data source -> my_file.csv, Extract -> column, Column name -> the name of the field. How can I fix it?

hasakiXie123 commented 5 months ago

I've run into the same problem.

msorkhpar commented 5 months ago

I gave up using their UI and started doing it manually.

gorovuha commented 5 months ago

I tried API Croissant builder https://huggingface.co/docs/datasets-server/en/croissant It works, although the RecordSet field in the result is empty (it was a problem in UI editor), I'm not sure is it correct way, but a step forward to a solution. I'm also curious do I have to create one croissant file for my four csv or four croissants would be ok 🤕

msorkhpar commented 5 months ago

https://gist.github.com/msorkhpar/95e366348287812cb0ff6e2249c8c146

gorovuha commented 5 months ago

@msorkhpar Thank you very much!!!! 😇

varungupta31 commented 4 months ago

@gorovuha @msorkhpar I have my dataset in a nested JSON format.

{ 'data1': {'tag1': something, 'tag2': something} .... }

Can you please help with how I should proceed to create croissant for this?

Also, I have multiple such JSONs, do I create croissant for each?

I tried the HF editor but am stuck at

[Metadata(*****) > RecordSet(data.json_record_set) > Field(v_--ifbq)] Node "v_--ifbq" is a field and has no source. Please, use http://mlcommons.org/croissant/source to specify the source.

And the link leads nowhere.

Please help me figure this out, thanks :(

gorovuha commented 4 months ago

@varungupta31 I'm not fluent with jsons, but there is an example of creating Fields for json files: https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/recipes/introduction.ipynb

There is a solution for creating croissant metadata for several files in one zip archive posted on GitHub: https://gist.github.com/msorkhpar/95e366348287812cb0ff6e2249c8c146 (I had to change some lines, such as content url for each file, names etc.) Then I downloaded metadata.json to online Editor and corrected some fields, ensuring my metadata.json file allows downloading dataset. I made only one file for multiple csv files

I hope a combination of these examples might be useful

msorkhpar commented 4 months ago

@gorovuha @msorkhpar I have my dataset in a nested JSON format.

{ 'data1': {'tag1': something, 'tag2': something} .... }

Can you please help with how I should proceed to create croissant for this?

Also, I have multiple such JSONs, do I create croissant for each?

I tried the HF editor but am stuck at

[Metadata(*****) > RecordSet(data.json_record_set) > Field(v_--ifbq)] Node "v_--ifbq" is a field and has no source. Please, use http://mlcommons.org/croissant/source to specify the source.

And the link leads nowhere.

Please help me figure this out, thanks :(

I am not an expert here, but I can tell you what I did. First, I created a metadata.json file using the samples in their repository and the main documentation page for each of my datasets. Then, I tried to combine different variants of my dataset in one file but ended up creating a separate file for each. Finally, I used a code snippet to make the JSON files dynamically. The decision to make one or more JSON files is yours, and it depends on how you would like your clients to deal with your dataset. If someone should be able to work with all the files at once and are related to the same task, then you need to add multiple recordsets and each recordset multiple fields to your metadata. Here is a sample mentioned in the README.md

import mlcroissant as mlc
ds = mlc.Dataset("https://raw.githubusercontent.com/mlcommons/croissant/main/datasets/1.0/gpt-3/metadata.json")
metadata = ds.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")
for x in ds.records(record_set="default"):
    print(x)