mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
399 stars 38 forks source link

[NEURIPS] `.zip` and `.tar.gz` archives are not supported for file uploading #663

Open amorehead opened 3 months ago

amorehead commented 3 months ago
amorehead commented 3 months ago

This appears related to https://github.com/mlcommons/croissant/issues/547, so I am closing this issue.

amorehead commented 3 months ago

I am reopening this issue per the NeurIPS organizer's recommendation.

JovinLeong commented 2 months ago

Hi, wanted to check what you mean by this. I have a FileObject with a content_url pointing to a publicly available .zip file and I have my encoding_format set to application/zip similar to coco2014 but I'm getting the following error:

ValueError: Unsupported compression method for file: ...

and

GenerationError: An error occurred during the streaming generation of the dataset, more specifically during the operation Extract(training_data).

Is this the same issue you're facing? I'm able to get it working if I don't upload a compressed .zip file.

Edit: I tried updating my content_url to refer to the .zip locally instead and it works perfectly - I'm just not able to get it to work with a content_url that points to a remote .zip file

amorehead commented 2 months ago

Hi, @JovinLeong. I believe the issue for me is that I'm trying to point to a remote .zip/.tar.gz archive. Good to know that local paths work though!

JovinLeong commented 2 months ago

Okay, then it seems like we're facing the same issue then - which seems odd since the coco2014 example uses a remote .zip. Though tbf it seems like coco2014 isn't working for me anyway