We saw that some datasets are not converted correctly to .pq.
More-over, we want to change our Minio usage from a single bucket per dataset, towards a single bucket for all datasets, per Minio recommendations ("We would recommend in the strongest terms not to have 10k+ buckets" https://subnet.min.io/issues/9582).
So, let's
[x] Re-convert all dataset from .arff to .pq, and upload them to Minio: dataset/0000/[id]/[id].pq for datasets with id from 0 to 9999, dataset/0001/[id]/[id].pq for datasets with id from 10000 to 19999 etc.
[x] Fix errors, or at least: make sure conversion is able to continue after "broken" conversion
[x] Run a cronjob on production that periodically converts new datasets
[x] Point the php server towards the new location of the .pq
[x] Upload croissants to the new buckets: dataset/0000/[id]/croissant_[id].json (Jos)
22180 - https://www.openml.org/api/v1/xml/data/[ID] returned code 111: Unknown dataset - None
1116 - This dataset has sparse arff format. Not converting it, to avoid OOM exception.
287 - https://www.openml.org/api/v1/xml/data/features/[ID] returned code 274: No features found.
251 - No access granted - None
33 - Invalid layout of the ARFF file, at line [LINE]
26 - not in index
14 - Bad @ATTRIBUTE type
14 - Checksum of downloaded file is unequal to the expected checksum
11 - Data value [VALUE] not found in nominal declaration, at line [LINE].
9 - 'NoneType' object has no attribute 'casefold'
6 - list index out of range
5 - not enough values to unpack (expected 2, got 1)
5 - Unexpected server error when calling
4 - Unexpected dataset.format: csv
4 - https://www.openml.org/api/v1/xml/data/[ID] returned code 113: Could not find data file record - None
3 - Manually stopped. Conversion hangs for (at least) 12 hours
1 - Invalid numerical value, at line 1972.
1 - Bad @DATA instance format
1 - Bad @ATTRIBUTE name class at line 10003, this name is already in use in line 820.
1 - month must be in 1..12: 000102, at position 0
1 - Manually stopped. OOM: Arff file of 180MB, but eats up all memory while converting
1 - Bad @ATTRIBUTE format
We saw that some datasets are not converted correctly to
.pq
. More-over, we want to change our Minio usage from a single bucket per dataset, towards a single bucket for all datasets, per Minio recommendations ("We would recommend in the strongest terms not to have 10k+ buckets" https://subnet.min.io/issues/9582).So, let's
dataset/0000/[id]/[id].pq
for datasets with id from 0 to 9999,dataset/0001/[id]/[id].pq
for datasets with id from 10000 to 19999 etc..pq
dataset/0000/[id]/croissant_[id].json
(Jos)Status
1) is done (code @ https://github.com/openml-labs/minio-data). Some problems can be found in the logs; also all sparse ARFF are ignored (to avoid OOM); 4546, 43091 and 43092 were aborted after hours. 2) I've created a csv with all errors. See error list at bottom of this issue. 3) Runs every 10 minutes on K8s. See https://github.com/openml/openml-internal-infra-wiki/blob/main/pages/kubernetes/arff-to-pq.md 4) Deployed, see https://github.com/openml/OpenML/pull/1203 Example: https://www.openml.org/api/v1/json/data/45702 5) Croissant: done. See https://github.com/openml/openml-internal-infra-wiki/blob/main/pages/kubernetes/croissant-converter.md
Errors