mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
370 stars 38 forks source link

JSON LD parsing fails when multiple strings given to FileSet.includes #698

Open brendon-boldt opened 1 month ago

brendon-boldt commented 1 month ago

FileSet(..., includes=["xyz"]) works but FileSet(..., includes=["xyz", "wxy"]) does not work, failing with:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "REDACTED/util/croissant.py", line 154, in <module>
    test()
  File "REDACTED/util/croissant.py", line 141, in test
    dataset = mlc.Dataset(jsonld="croissant.json")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 6, in __init__
  File "REDACTED/mlcroissant/_src/datasets.py", line 70, in __post_init__
    self.metadata = Metadata.from_file(ctx=ctx, file=self.jsonld)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/structure_graph/nodes/metadata.py", line 429, in from_file
    return cls.from_json(ctx=ctx, json_=json_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/structure_graph/nodes/metadata.py", line 439, in from_json
    jsonld = expand_jsonld(json_, ctx=ctx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 231, in expand_jsonld
    recursively_populate_jsonld(entry_node, id_to_node)
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 171, in recursively_populate_jsonld
    value = [recursively_populate_jsonld(child, id_to_node) for child in value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 161, in recursively_populate_jsonld
    return recursively_populate_jsonld(entry_node, id_to_node)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 171, in recursively_populate_jsonld
    value = [recursively_populate_jsonld(child, id_to_node) for child in value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 164, in recursively_populate_jsonld
    for key, value in entry_node.copy().items():
                      ^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'copy'

presumably due to https://github.com/mlcommons/croissant/blob/0f95e04763557929e4f4c6711c108c0d9cf7b818/python/mlcroissant/mlcroissant/_src/core/json_ld.py#L169 trying to recurse into the array of strings by running recursively_populate_jsonld on strs when it seems to be intended for dicts only. One simple fix would be to exit early from recusrively_populate_jsonld is entry_node is not a dict, but maybe this is just masking a deeper problem.

If you want me to open a PR or something, let me know. Thanks.

marcenacp commented 1 month ago

Hi @brendon-boldt, nice catch! Can you please open a PR? That'd be super useful