Open venkanna37 opened 6 months ago
Do not use schema:Text
. Use xsd:string
@venkanna37 I think you just found a bug in mlcroissant! Thanks for taking the time to report it to us.
I suspect the key
parameter is the problem because we haven't implemented it in mlcroissant yet. I just opened https://github.com/mlcommons/croissant/issues/655 to follow this feature.
When I remove the key
parameter in python, I get a different error, because you expect a join but never make the join explicitly. You can use references
to make the join (see this example).
Thank you @marcenacp I added the key parameter after seeing an example in the documentation. Thank you for sharing an example for join. I tried joining both input and label images, but I failed to do that. Can you please have a look at my code?
import mlcroissant as mlc
import json
# FileObjects and FileSets define the resources of the dataset.
distribution = [
mlc.FileObject(
id="test1.zip",
name="test1.zip",
description="A ZIP archive containing the dataset.",
content_url="https://sid.erda.dk/share_redirect/dggmWwOQn4/test1.zip",
encoding_format="application/zip",
sha256="b57b47d174b331ca6fc1376abc7f95985d35978f66eb5183b4ac3ed5efc6e9f5",
),
mlc.FileSet(
id="source-images",
name="source-images",
description="RGB Images of dataset",
contained_in=["test1.zip"],
encoding_format="image/tiff",
includes="train/images/*.tif",
),
mlc.FileSet(
id="label-images",
name="label-images",
description="RGB Images of dataset",
contained_in=["test1.zip"],
encoding_format="image/tiff",
includes="train/labels/*.tif",
)
]
# Define the record set for images and labels.
record_set = [
mlc.RecordSet(
id="images",
name="images",
fields=[
mlc.Field(
id="images/filename",
name="images/filename",
data_types=[mlc.DataType.TEXT],
source=mlc.Source(
file_set="source-images",
extract=mlc.Extract(file_property="filename"),
transforms=[mlc.Transform(regex="^(.*)\.tif$")]
)
),
mlc.Field(
id="images/image_content",
name="image_content",
description="Image content.",
data_types=[mlc.DataType.IMAGE_OBJECT],
source=mlc.Source(
file_set="source-images",
extract=mlc.Extract(file_property="content")
),
references=mlc.Source(
file_set="source-images",
extract=mlc.Extract(file_property="filename"),
transforms=[mlc.Transform(regex="^(.*)\.tif$")])
),
mlc.Field(
id="images/label",
name="Image Label",
description="Semantic segmentation label image.",
data_types=[mlc.DataType.IMAGE_OBJECT],
source=mlc.Source(
file_set="label-images",
extract=mlc.Extract(file_property="content")
),
references=mlc.Source(
file_set="source-images",
extract=mlc.Extract(file_property="filename"),
transforms=[mlc.Transform(regex="^(.*)\.tif$")])
)
]
)
]
# Metadata contains information about the dataset.
metadata = mlc.Metadata(
name="semantic-segmentation",
description="Test dataset for croissant format.",
distribution=distribution,
record_sets=record_set,
version="1.0.0",
license="CC BY-SA 4.0",
)
print(metadata.issues.report())
with open("croissant2.json", "w") as f:
content = metadata.to_json()
content = json.dumps(content, indent=2)
print(content)
f.write(content)
f.write("\n")
dataset = mlc.Dataset(jsonld="croissant2.json", debug=True)
# Iterate over the records and print them
try:
records = dataset.records(record_set="images")
for i, record in enumerate(records):
print(record)
if i >= 10:
break
except RuntimeError as e:
print(f"RuntimeError: {e}")
except ValueError as e:
print(f"ValueError: {e}")
Here is the error:
Traceback (most recent call last):
File "/home/venky/Documents/projects/roof_detection/Croissant/nacala_data.py", line 105, in <module>
for i, record in enumerate(records):
File "/home/venky/anaconda3/envs/crois/lib/python3.12/site-packages/mlcroissant/_src/datasets.py", line 139, in __iter__
yield from execute_operations_sequentially(
File "/home/venky/anaconda3/envs/crois/lib/python3.12/site-packages/mlcroissant/_src/operation_graph/execute.py", line 72, in execute_operations_sequentially
raise GenerationError(
mlcroissant._src.core.issues.GenerationError: An error occured during the sequential generation of the dataset, more specifically during the operation FilterFiles(label-images)
@venkanna37 I think you just found a bug in mlcroissant! Thanks for taking the time to report it to us.
I suspect the
key
parameter is the problem because we haven't implemented it in mlcroissant yet. I just opened #655 to follow this feature.When I remove the
key
parameter in python, I get a different error, because you expect a join but never make the join explicitly. You can usereferences
to make the join (see this example).
The simple_join example above uses manually added data in the referenced RecordSet. The solution only seems to work if the data are added manually, and returns "nan" for the referenced field when the data is read from a fileObject. @marcenacp do you happen to have a solution for this: #700 ?
Hi, I recently started working with Croissant on creating a dataset for semantic segmentation. The dataset has images and labels, both in *.tif format. There were no errors while programmatically writing and verifying the JSON-LD croissant file for my dataset. But I am repeatedly getting this error when I load the dataset. I thought it was a silly mistake but could not fix it. I wrote my code by taking the Introduction notebook as a reference. I am attaching my code, JSON-LD file, and a complete error. Can you please help me fix this? Thank you
Code:
JSON-LD file
Error