openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

Re-upload Parquet files to single "dataset" bucket #66

Open PGijsbers opened 2 months ago

PGijsbers commented 2 months ago

We saw that some datasets are not converted correctly to .pq. More-over, we want to change our Minio usage from a single bucket per dataset, towards a single bucket for all datasets, per Minio recommendations ("We would recommend in the strongest terms not to have 10k+ buckets" https://subnet.min.io/issues/9582).

So, let's

  1. [x] Re-convert all dataset from .arff to .pq, and upload them to Minio: dataset/0000/[id]/[id].pq for datasets with id from 0 to 9999, dataset/0001/[id]/[id].pq for datasets with id from 10000 to 19999 etc.
  2. [x] Fix errors, or at least: make sure conversion is able to continue after "broken" conversion
  3. [x] Run a cronjob on production that periodically converts new datasets
  4. [x] Point the php server towards the new location of the .pq
  5. [x] Upload croissants to the new buckets: dataset/0000/[id]/croissant_[id].json (Jos)
  6. [x] Wait a couple of days
  7. [ ] Set old buckets on private
  8. [ ] Wait a couple of weeks
  9. [ ] Delete old buckets

Status

1) is done (code @ https://github.com/openml-labs/minio-data). Some problems can be found in the logs; also all sparse ARFF are ignored (to avoid OOM); 4546, 43091 and 43092 were aborted after hours. 2) I've created a csv with all errors. See error list at bottom of this issue. 3) Runs every 10 minutes on K8s. See https://github.com/openml/openml-internal-infra-wiki/blob/main/pages/kubernetes/arff-to-pq.md 4) Deployed, see https://github.com/openml/OpenML/pull/1203 Example: https://www.openml.org/api/v1/json/data/45702 5) Croissant: done. See https://github.com/openml/openml-internal-infra-wiki/blob/main/pages/kubernetes/croissant-converter.md

Errors

22180 - https://www.openml.org/api/v1/xml/data/[ID] returned code 111: Unknown dataset - None
 1116 - This dataset has sparse arff format. Not converting it, to avoid OOM exception.
  287 - https://www.openml.org/api/v1/xml/data/features/[ID] returned code 274: No features found.
  251 - No access granted - None
   33 - Invalid layout of the ARFF file, at line [LINE]
   26 - not in index
   14 - Bad @ATTRIBUTE type
   14 - Checksum of downloaded file is unequal to the expected checksum
   11 - Data value [VALUE] not found in nominal declaration, at line [LINE].
    9 - 'NoneType' object has no attribute 'casefold'
    6 - list index out of range
    5 - not enough values to unpack (expected 2, got 1)
    5 - Unexpected server error when calling
    4 - Unexpected dataset.format: csv
    4 - https://www.openml.org/api/v1/xml/data/[ID] returned code 113: Could not find data file record - None
    3 - Manually stopped. Conversion hangs for (at least) 12 hours
    1 - Invalid numerical value, at line 1972.
    1 - Bad @DATA instance format
    1 - Bad @ATTRIBUTE name class at line 10003, this name is already in use in line 820.
    1 - month must be in 1..12: 000102, at position 0
    1 - Manually stopped. OOM: Arff file of 180MB, but eats up all memory while converting
    1 - Bad @ATTRIBUTE format