mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
399 stars 38 forks source link

"images/filename" should have an attribute "@type": "https://schema.org/Text". Got http://mlcommons.org/croissant/Field instead. #651

Open venkanna37 opened 3 months ago

venkanna37 commented 3 months ago

Hi, I recently started working with Croissant on creating a dataset for semantic segmentation. The dataset has images and labels, both in *.tif format. There were no errors while programmatically writing and verifying the JSON-LD croissant file for my dataset. But I am repeatedly getting this error when I load the dataset. I thought it was a silly mistake but could not fix it. I wrote my code by taking the Introduction notebook as a reference. I am attaching my code, JSON-LD file, and a complete error. Can you please help me fix this? Thank you

Code:

import mlcroissant as mlc
import json

# FileObjects and FileSets define the resources of the dataset.
distribution = [
    mlc.FileObject(
        id="test1.zip",
        name="test1.zip",
        description="A ZIP archive containing the dataset.",
        content_url="https://sid.erda.dk/share_redirect/dggmWwOQn4/test1.zip",
        encoding_format="application/zip",
        sha256="b57b47d174b331ca6fc1376abc7f95985d35978f66eb5183b4ac3ed5efc6e9f5",
    ),
    mlc.FileSet(
        id="source-images",
        name="source-images",
        description="RGB Images of dataset",
        contained_in=["test1.zip"],
        encoding_format="image/tiff",
        includes="train/images/*.tif",
    ),
    mlc.FileSet(
        id="label-images",
        name="label-images",
        description="RGB Images of dataset",
        contained_in=["test1.zip"],
        encoding_format="image/tiff",
        includes="train/labels/*.tif",
    )
]

# Define the record set for images and labels.
record_set = [
    mlc.RecordSet(
        id="images",
        key={
            "@id": "images/filename"
        },  # Using image filename as a unique key.
        name="images",
        fields=[
            mlc.Field(
                id="images/filename",
                name="images/filename",
                data_types=[mlc.DataType.TEXT],
                source=mlc.Source(
                    file_set="source-images",
                    extract=mlc.Extract(file_property="filename")
                )
            ),
            mlc.Field(
                id="images/image_content",
                name="image_content",
                description="Image content.",
                data_types=[mlc.DataType.IMAGE_OBJECT],
                source=mlc.Source(
                    file_set="source-images",
                    extract=mlc.Extract(file_property="content")
                )
            ),
            mlc.Field(
                id="images/label",
                name="Image Label",
                description="Semantic segmentation label image.",
                data_types=[mlc.DataType.IMAGE_OBJECT],
                source=mlc.Source(
                    file_set="label-images",
                    extract=mlc.Extract(file_property="content")
                )
            )
        ]
    )
]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="semantic-segmentation",
    description="Test dataset for croissant format.",
    distribution=distribution,
    record_sets=record_set,
    version="1.0.0",
    license="CC BY-SA 4.0",
)
print(metadata.issues.report())

with open("croissant.json", "w") as f:
    content = metadata.to_json()
    content = json.dumps(content, indent=2)
    print(content)
    f.write(content)
    f.write("\n")

dataset = mlc.Dataset(jsonld="croissant.json")

records = dataset.records(record_set="images")

for i, record in enumerate(records):
    print(record)
    if i >= 10:
        break

JSON-LD file

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:regex",
    "repeated": "cr:repeated",
    "replace": "cr:replace",
    "sc": "https://schema.org/",
    "separator": "cr:separator",
    "source": "cr:source",
    "subField": "cr:subField",
    "transform": "cr:transform"
  },
  "@type": "sc:Dataset",
  "name": "semantic-segmentation",
  "description": "Test dataset for croissant format.",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "license": "CC BY-SA 4.0",
  "version": "1.0.0",
  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "test1.zip",
      "name": "test1.zip",
      "description": "A ZIP archive containing the dataset.",
      "contentUrl": "https://sid.erda.dk/share_redirect/dggmWwOQn4/test1.zip",
      "encodingFormat": "application/zip",
      "sha256": "b57b47d174b331ca6fc1376abc7f95985d35978f66eb5183b4ac3ed5efc6e9f5"
    },
    {
      "@type": "cr:FileSet",
      "@id": "source-images",
      "name": "source-images",
      "description": "RGB Images of dataset",
      "containedIn": {
        "@id": "test1.zip"
      },
      "encodingFormat": "image/tiff",
      "includes": "train/images/*.tif"
    },
    {
      "@type": "cr:FileSet",
      "@id": "label-images",
      "name": "label-images",
      "description": "RGB Images of dataset",
      "containedIn": {
        "@id": "test1.zip"
      },
      "encodingFormat": "image/tiff",
      "includes": "train/labels/*.tif"
    }
  ],
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "images",
      "name": "images",
      "key": {
        "@id": "images/filename"
      },
      "field": [
        {
          "@type": "cr:Field",
          "@id": "images/filename",
          "name": "images/filename",
          "dataType": "sc:Text",
          "source": {
            "fileSet": {
              "@id": "source-images"
            },
            "extract": {
              "fileProperty": "filename"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "images/image_content",
          "name": "image_content",
          "description": "Image content.",
          "dataType": "sc:ImageObject",
          "source": {
            "fileSet": {
              "@id": "source-images"
            },
            "extract": {
              "fileProperty": "content"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "images/label",
          "name": "Image Label",
          "description": "Semantic segmentation label image.",
          "dataType": "sc:ImageObject",
          "source": {
            "fileSet": {
              "@id": "label-images"
            },
            "extract": {
              "fileProperty": "content"
            }
          }
        }
      ]
    }
  ]
}

Error

Traceback (most recent call last):
  File "/home/venky/Documents/projects/roof_detection/Croissant/nacala_data.py", line 92, in <module>
    dataset = mlc.Dataset(jsonld="croissant.json")
  File "<string>", line 6, in __init__
  File "/home/venky/anaconda3/envs/building_task/lib/python3.10/site-packages/mlcroissant/_src/datasets.py", line 70, in __post_init__
    self.metadata = Metadata.from_file(ctx=ctx, file=self.jsonld)
  File "/home/venky/anaconda3/envs/building_task/lib/python3.10/site-packages/mlcroissant/_src/structure_graph/nodes/metadata.py", line 429, in from_file
    return cls.from_json(ctx=ctx, json_=json_)
  File "/home/venky/anaconda3/envs/building_task/lib/python3.10/site-packages/mlcroissant/_src/structure_graph/nodes/metadata.py", line 440, in from_json
    return cls.from_jsonld(ctx=ctx, jsonld=jsonld)
  File "/home/venky/anaconda3/envs/building_task/lib/python3.10/site-packages/mlcroissant/_src/structure_graph/base_node.py", line 392, in from_jsonld
    return cls(
  File "<string>", line 45, in __init__
  File "/home/venky/anaconda3/envs/building_task/lib/python3.10/site-packages/mlcroissant/_src/structure_graph/nodes/metadata.py", line 331, in __post_init__
    raise ValidationError(node.ctx.issues.report())
mlcroissant._src.core.issues.ValidationError: Found the following 1 error(s) during the validation:
  -  "images/filename" should have an attribute "@type": "https://schema.org/Text". Got http://mlcommons.org/croissant/Field instead.
VladimirAlexiev commented 3 months ago

Do not use schema:Text. Use xsd:string

marcenacp commented 3 months ago

@venkanna37 I think you just found a bug in mlcroissant! Thanks for taking the time to report it to us.

I suspect the key parameter is the problem because we haven't implemented it in mlcroissant yet. I just opened https://github.com/mlcommons/croissant/issues/655 to follow this feature.

When I remove the key parameter in python, I get a different error, because you expect a join but never make the join explicitly. You can use references to make the join (see this example).

venkanna37 commented 3 months ago

Thank you @marcenacp I added the key parameter after seeing an example in the documentation. Thank you for sharing an example for join. I tried joining both input and label images, but I failed to do that. Can you please have a look at my code?

import mlcroissant as mlc
import json

# FileObjects and FileSets define the resources of the dataset.
distribution = [
    mlc.FileObject(
        id="test1.zip",
        name="test1.zip",
        description="A ZIP archive containing the dataset.",
        content_url="https://sid.erda.dk/share_redirect/dggmWwOQn4/test1.zip",
        encoding_format="application/zip",
        sha256="b57b47d174b331ca6fc1376abc7f95985d35978f66eb5183b4ac3ed5efc6e9f5",
    ),
    mlc.FileSet(
        id="source-images",
        name="source-images",
        description="RGB Images of dataset",
        contained_in=["test1.zip"],
        encoding_format="image/tiff",
        includes="train/images/*.tif",
    ),
    mlc.FileSet(
        id="label-images",
        name="label-images",
        description="RGB Images of dataset",
        contained_in=["test1.zip"],
        encoding_format="image/tiff",
        includes="train/labels/*.tif",
    )
]

# Define the record set for images and labels.
record_set = [
    mlc.RecordSet(
        id="images",
        name="images",
        fields=[
            mlc.Field(
                id="images/filename",
                name="images/filename",
                data_types=[mlc.DataType.TEXT],
                source=mlc.Source(
                    file_set="source-images",
                    extract=mlc.Extract(file_property="filename"),
                    transforms=[mlc.Transform(regex="^(.*)\.tif$")]
                )

            ),
            mlc.Field(
                id="images/image_content",
                name="image_content",
                description="Image content.",
                data_types=[mlc.DataType.IMAGE_OBJECT],
                source=mlc.Source(
                    file_set="source-images",
                    extract=mlc.Extract(file_property="content")
                ),
                references=mlc.Source(
                    file_set="source-images",
                    extract=mlc.Extract(file_property="filename"),
                    transforms=[mlc.Transform(regex="^(.*)\.tif$")])
            ),
            mlc.Field(
                id="images/label",
                name="Image Label",
                description="Semantic segmentation label image.",
                data_types=[mlc.DataType.IMAGE_OBJECT],
                source=mlc.Source(
                    file_set="label-images",
                    extract=mlc.Extract(file_property="content")
                ),
                references=mlc.Source(
                    file_set="source-images",
                    extract=mlc.Extract(file_property="filename"),
                    transforms=[mlc.Transform(regex="^(.*)\.tif$")])
            )
        ]
    )
]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="semantic-segmentation",
    description="Test dataset for croissant format.",
    distribution=distribution,
    record_sets=record_set,
    version="1.0.0",
    license="CC BY-SA 4.0",
)

print(metadata.issues.report())

with open("croissant2.json", "w") as f:
    content = metadata.to_json()
    content = json.dumps(content, indent=2)
    print(content)
    f.write(content)
    f.write("\n")

dataset = mlc.Dataset(jsonld="croissant2.json", debug=True)

# Iterate over the records and print them
try:
    records = dataset.records(record_set="images")
    for i, record in enumerate(records):
        print(record)
        if i >= 10:
            break
except RuntimeError as e:
    print(f"RuntimeError: {e}")
except ValueError as e:
    print(f"ValueError: {e}")

Here is the error:

Traceback (most recent call last):
  File "/home/venky/Documents/projects/roof_detection/Croissant/nacala_data.py", line 105, in <module>
    for i, record in enumerate(records):
  File "/home/venky/anaconda3/envs/crois/lib/python3.12/site-packages/mlcroissant/_src/datasets.py", line 139, in __iter__
    yield from execute_operations_sequentially(
  File "/home/venky/anaconda3/envs/crois/lib/python3.12/site-packages/mlcroissant/_src/operation_graph/execute.py", line 72, in execute_operations_sequentially
    raise GenerationError(
mlcroissant._src.core.issues.GenerationError: An error occured during the sequential generation of the dataset, more specifically during the operation FilterFiles(label-images)
EMCarrami commented 2 months ago

@venkanna37 I think you just found a bug in mlcroissant! Thanks for taking the time to report it to us.

I suspect the key parameter is the problem because we haven't implemented it in mlcroissant yet. I just opened #655 to follow this feature.

When I remove the key parameter in python, I get a different error, because you expect a join but never make the join explicitly. You can use references to make the join (see this example).

The simple_join example above uses manually added data in the referenced RecordSet. The solution only seems to work if the data are added manually, and returns "nan" for the referenced field when the data is read from a fileObject. @marcenacp do you happen to have a solution for this: #700 ?