huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
689 stars 76 forks source link

Raise specific errors (and error_code) instead of UnexpectedError #1443

Open severo opened 1 year ago

severo commented 1 year ago

The following query on the production database gives the number of datasets with at least one cache entry with error_code "UnexpectedError", grouped by the underlying "cause_exception".

For the most common ones (DatasetGenerationError, HfHubHTTPError, OSError, etc.) we would benefit from raising a specific error with its error code. It would allow to:

null means it has no details.cause_exception. These cache entries should be inspected more closely. See https://github.com/huggingface/datasets-server/issues/1123 in particular, which is one of the cases where no cause exception is reported.

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 1964 }
{ _id: null, count: 1388 }
{ _id: 'HfHubHTTPError', count: 1154 }
{ _id: 'OSError', count: 433 }
{ _id: 'FileNotFoundError', count: 242 }
{ _id: 'FileExistsError', count: 198 }
{ _id: 'ValueError', count: 186 }
{ _id: 'TypeError', count: 160 }
{ _id: 'ConnectionError', count: 146 }
{ _id: 'RuntimeError', count: 86 }
{ _id: 'NonMatchingSplitsSizesError', count: 83 }
{ _id: 'FileSystemError', count: 62 }
{ _id: 'ClientResponseError', count: 52 }
{ _id: 'ArrowInvalid', count: 45 }
{ _id: 'ParquetResponseEmptyError', count: 43 }
{ _id: 'RepositoryNotFoundError', count: 41 }
{ _id: 'ManualDownloadError', count: 39 }
{ _id: 'IndexError', count: 28 }
{ _id: 'AttributeError', count: 16 }
{ _id: 'KeyError', count: 15 }
{ _id: 'GatedRepoError', count: 13 }
{ _id: 'NotImplementedError', count: 11 }
{ _id: 'ExpectedMoreSplits', count: 9 }
{ _id: 'PermissionError', count: 8 }
{ _id: 'BadRequestError', count: 7 }
{ _id: 'NonMatchingChecksumError', count: 6 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'UnboundLocalError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'ZeroDivisionError', count: 3 }
{ _id: 'InvalidDocument', count: 3 }
{ _id: 'DoesNotExist', count: 3 }
{ _id: 'EOFError', count: 3 }
{ _id: 'ImportError', count: 3 }
{ _id: 'NotADirectoryError', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'ReadTimeout', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'ExpectedMoreDownloadedFiles', count: 2 }
{ _id: 'InvalidConfigName', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'MissingBeamOptions', count: 2 }
{ _id: 'HTTPError', count: 1 }
{ _id: 'BadZipFile', count: 1 }
{ _id: 'OverflowError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'IsADirectoryError', count: 1 }
{ _id: 'OperationalError', count: 1 }
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo commented 1 year ago

We need to do it to provide better feedback to the user, and to retry when appropriate.

severo commented 1 year ago

Copying from #1462

Updated query (Without errors from parent):

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", kind:"split-duckdb-index", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}])

From 128617 records currently existing in cache collection, these are the top kind of UnexpectedErrors:


[
{ _id: { cause: 'HfHubHTTPError' }, count: 4429 },
{ _id: { cause: 'HTTPException' }, count: 2570 },
{ _id: { cause: 'Error' }, count: 54 },
{ _id: { cause: 'BinderException' }, count: 41 },
{ _id: { cause: 'CatalogException' }, count: 38 },
{ _id: { cause: 'ParserException' }, count: 29 },
{ _id: { cause: 'InvalidInputException' }, count: 22 },
{ _id: { cause: 'RuntimeError' }, count: 8 },
{ _id: { cause: 'IOException' }, count: 5 },
{ _id: { cause: 'BadRequestError' }, count: 2 },
{ _id: { cause: 'NotPrimaryError' }, count: 2 },
{ _id: { cause: 'EntryNotFoundError' }, count: 2 }
]


> Since this is a new job runner, most of these should be evaluated in case there is a bug in the code.
AndreaFrancis commented 1 year ago

Updating list: datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}]) [ { _id: { cause: 'AttributeError' }, count: 9876 }, { _id: { cause: 'ClientResponseError' }, count: 6034 }, { _id: { cause: 'DatasetGenerationError' }, count: 5674 }, { _id: { cause: 'ParserException' }, count: 3058 }, { _id: { cause: 'TypeError' }, count: 2689 }, { _id: { cause: 'IOException' }, count: 1961 }, { _id: { cause: 'InvalidInputException' }, count: 1814 }, { _id: { cause: 'ZeroDivisionError' }, count: 1693 }, { _id: { cause: 'FileNotFoundError' }, count: 1687 }, { _id: { cause: 'HfHubHTTPError' }, count: 1316 }, { _id: { cause: 'HTTPException' }, count: 1216 }, { _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1141 }, { _id: { cause: 'EntryNotFoundError' }, count: 895 }, { _id: { cause: 'ValueError' }, count: 827 }, { _id: { cause: 'BinderException' }, count: 789 }, { _id: { cause: 'KeyError' }, count: 608 }, { _id: { cause: 'ParquetResponseEmptyError' }, count: 598 }, { _id: { cause: 'NotImplementedError' }, count: 509 }, { _id: { cause: 'CachedArtifactNotFoundError' }, count: 457 }, { _id: { cause: null }, count: 370 } { _id: { cause: 'ReadTimeout' }, count: 329 }, { _id: { cause: 'ConnectionError' }, count: 264 }, { _id: { cause: 'LocationParseError' }, count: 191 }, { _id: { cause: 'OSError' }, count: 186 }, { _id: { cause: 'IndexError' }, count: 155 }, { _id: { cause: 'AssertionError' }, count: 84 }, { _id: { cause: 'BadZipFile' }, count: 63 }, { _id: { cause: 'ArrowInvalid' }, count: 57 }, { _id: { cause: 'OutOfRangeException' }, count: 53 }, { _id: { cause: 'CatalogException' }, count: 44 }, { _id: { cause: 'ModuleNotFoundError' }, count: 41 }, { _id: { cause: 'RuntimeError' }, count: 39 }, { _id: { cause: 'LocalEntryNotFoundError' }, count: 26 }, { _id: { cause: 'UnboundLocalError' }, count: 26 }, { _id: { cause: 'FileExistsError' }, count: 24 }, { _id: { cause: 'Error' }, count: 24 }, { _id: { cause: 'RepositoryNotFoundError' }, count: 21 }, { _id: { cause: 'InvalidOperation' }, count: 16 }, { _id: { cause: 'ExpectedMoreSplits' }, count: 15 }, { _id: { cause: 'ImportError' }, count: 12 } { _id: { cause: 'ServerDisconnectedError' }, count: 11 }, { _id: { cause: 'NameError' }, count: 9 }, { _id: { cause: 'SyntaxError' }, count: 8 }, { _id: { cause: 'PermissionError' }, count: 6 }, { _id: { cause: 'InternalException' }, count: 5 }, { _id: { cause: 'ChunkedEncodingError' }, count: 5 }, { _id: { cause: 'InvalidDocument' }, count: 4 }, { _id: { cause: 'ParserError' }, count: 3 }, { _id: { cause: 'DoesNotExist' }, count: 3 }, { _id: { cause: 'ConversionException' }, count: 3 }, { _id: { cause: 'NonStreamableDatasetError' }, count: 3 }, { _id: { cause: 'SSLError' }, count: 3 }, { _id: { cause: 'Exception' }, count: 3 }, { _id: { cause: 'GatedRepoError' }, count: 3 }, { _id: { cause: 'JSONDecodeError' }, count: 2 }, { _id: { cause: 'InvalidConfigName' }, count: 2 }, { _id: { cause: 'FileSystemError' }, count: 1 }, { _id: { cause: 'AutoReconnect' }, count: 1 }, { _id: { cause: 'TypeMismatchException' }, count: 1 }, { _id: { cause: 'HFValidationError' }, count: 1 } { _id: { cause: 'EOFError' }, count: 1 }, { _id: { cause: 'OperationalError' }, count: 1 }, { _id: { cause: 'TransactionException' }, count: 1 }, { _id: { cause: 'NotPrimaryError' }, count: 1 }, { _id: { cause: 'UnicodeDecodeError' }, count: 1 }, { _id: { cause: 'OutOfMemoryException' }, count: 1 } ]

AndreaFrancis commented 1 year ago

After doing some cache maintenance actions manually (removing obsolete records which config or split no longer exist) this is the updated list mostly AttributeError and ClientResponseError reduced:

[
  { _id: { cause: 'DatasetGenerationError' }, count: 3791 },
  { _id: { cause: 'TypeError' }, count: 2222 },
  { _id: { cause: 'ParserException' }, count: 2095 },
  { _id: { cause: 'InvalidInputException' }, count: 1750 },
  { _id: { cause: 'FileNotFoundError' }, count: 1393 },
  { _id: { cause: 'ZeroDivisionError' }, count: 1224 },
  { _id: { cause: 'HfHubHTTPError' }, count: 1128 },
  { _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1116 },
  { _id: { cause: 'IOException' }, count: 1035 },
  { _id: { cause: 'CachedArtifactNotFoundError' }, count: 745 },
  { _id: { cause: 'HTTPException' }, count: 526 },
  { _id: { cause: 'NotImplementedError' }, count: 493 },
  { _id: { cause: 'BinderException' }, count: 462 },
  { _id: { cause: 'KeyError' }, count: 454 },
  { _id: { cause: 'ReadTimeout' }, count: 311 },
  { _id: { cause: 'ParquetResponseEmptyError' }, count: 292 },
  { _id: { cause: 'ConnectionError' }, count: 201 },
  { _id: { cause: 'ValueError' }, count: 187 },
  { _id: { cause: 'AttributeError' }, count: 127 },
  { _id: { cause: 'IndexError' }, count: 107 },
  { _id: { cause: 'OSError' }, count: 102 },
  { _id: { cause: 'ClientResponseError' }, count: 94 },
  { _id: { cause: 'EntryNotFoundError' }, count: 92 },
  { _id: { cause: 'AssertionError' }, count: 84 },
  { _id: { cause: 'BadZipFile' }, count: 61 },
  { _id: { cause: 'OutOfRangeException' }, count: 46 },
  { _id: { cause: 'ModuleNotFoundError' }, count: 43 },
  { _id: { cause: 'LocationParseError' }, count: 29 },
  { _id: { cause: 'ArrowInvalid' }, count: 28 },
  { _id: { cause: 'CatalogException' }, count: 26 },
  { _id: { cause: 'LocalEntryNotFoundError' }, count: 19 },
  { _id: { cause: 'Error' }, count: 16 },
  { _id: { cause: 'ServerDisconnectedError' }, count: 9 },
  { _id: { cause: 'SyntaxError' }, count: 8 },
  { _id: { cause: 'InvalidOperation' }, count: 8 },
  { _id: { cause: 'RuntimeError' }, count: 7 },
  { _id: { cause: 'PermissionError' }, count: 6 },
  { _id: { cause: 'UnboundLocalError' }, count: 6 },
  { _id: { cause: 'NameError' }, count: 5 },
  { _id: { cause: 'NonStreamableDatasetError' }, count: 3 },
  { _id: { cause: 'Exception' }, count: 3 },
  { _id: { cause: 'ChunkedEncodingError' }, count: 3 },
  { _id: { cause: 'SSLError' }, count: 3 },
  { _id: { cause: 'ExpectedMoreSplits' }, count: 2 },
  { _id: { cause: 'ConversionException' }, count: 2 },
  { _id: { cause: null }, count: 2 },
  { _id: { cause: 'ParserError' }, count: 2 },
  { _id: { cause: 'RepositoryNotFoundError' }, count: 2 },
  { _id: { cause: 'OperationalError' }, count: 1 },
  { _id: { cause: 'UnicodeDecodeError' }, count: 1 },
  { _id: { cause: 'TransactionException' }, count: 1 },
  { _id: { cause: 'OutOfMemoryException' }, count: 1 },
  { _id: { cause: 'DoesNotExist' }, count: 1 },
  { _id: { cause: 'ImportError' }, count: 1 },
  { _id: { cause: 'HFValidationError' }, count: 1 },
  { _id: { cause: 'JSONDecodeError' }, count: 1 },
  { _id: { cause: 'EOFError' }, count: 1 },
  { _id: { cause: 'TypeMismatchException' }, count: 1 },
  { _id: { cause: 'InternalException' }, count: 1 }
]
AndreaFrancis commented 10 months ago

Update of UnexpectedErrors count by kind:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kindkind: 'config-parquet-and-info' }, count: 9117 },
  { _id: { kindkind: 'split-descriptive-statistics' }, count: 6685 },
  { _id: { kindkind: 'split-duckdb-index' }, count: 591 },
  { _id: { kindkind: 'split-first-rows-from-parquet' }, count: 11 }
]

For split-first-rows-from-parquet it will be fixed with https://github.com/huggingface/datasets-server/pull/2126

severo commented 10 months ago

interesting that only 4 steps produce all the unexpected errors

severo commented 10 months ago

For KeyError, see https://github.com/huggingface/huggingface_hub/issues/1853

severo commented 10 months ago

Current state:

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 2767 }
{ _id: 'HfHubHTTPError', count: 795 }
{ _id: 'TypeError', count: 633 }
{ _id: 'ZeroDivisionError', count: 621 }
{ _id: 'IOException', count: 514 }
{ _id: 'ReadTimeout', count: 245 }
{ _id: 'OSError', count: 151 }
{ _id: 'BinderException', count: 127 }
{ _id: 'ConnectionError', count: 119 }
{ _id: 'ValueError', count: 103 }
{ _id: 'ParserException', count: 91 }
{ _id: 'EntryNotFoundError', count: 66 }
{ _id: 'NotImplementedError', count: 66 }
{ _id: 'FileNotFoundError', count: 60 }
{ _id: 'NonMatchingSplitsSizesError', count: 43 }
{ _id: 'BrokenPipeError', count: 39 }
{ _id: 'InvalidInputException', count: 36 }
{ _id: 'IndexError', count: 30 }
{ _id: 'OutOfRangeException', count: 30 }
{ _id: 'HTTPException', count: 21 }
{ _id: 'LocationParseError', count: 17 }
{ _id: 'RuntimeError', count: 15 }
{ _id: 'KeyError', count: 13 }
{ _id: 'BadZipFile', count: 9 }
{ _id: 'Error', count: 7 }
{ _id: 'ExpectedMoreSplits', count: 5 }
{ _id: 'ArrowInvalid', count: 5 }
{ _id: 'ConversionException', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'ModuleNotFoundError', count: 3 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'NotPrimaryError', count: 3 }
{ _id: 'ParserError', count: 3 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'LocalEntryNotFoundError', count: 2 }
{ _id: 'RepositoryNotFoundError', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'JSONDecodeError', count: 1 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'RarCannotExec', count: 1 }
{ _id: 'OutOfMemoryException', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'SyntaxError', count: 1 }
{ _id: 'UnicodeDecodeError', count: 1 }
{ _id: 'EOFError', count: 1 }
AndreaFrancis commented 9 months ago

Updated list of UnexpectedErrors by kind:

[
  { _id: { kindkind: 'config-parquet-and-info' }, count: 8500 },
  { _id: { kindkind: 'split-descriptive-statistics' }, count: 2628 },
  { _id: { kindkind: 'split-duckdb-index' }, count: 794 }
]
severo commented 8 months ago

Current state:

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 3963 }
{ _id: 'TypeError', count: 958 }
{ _id: 'HfHubHTTPError', count: 778 }
{ _id: 'DatasetGenerationCastError', count: 287 }
{ _id: 'OSError', count: 219 }
{ _id: 'ValueError', count: 182 }
{ _id: 'ReadTimeout', count: 172 }
{ _id: 'ParserException', count: 127 }
{ _id: 'BinderException', count: 108 }
{ _id: 'ConnectionError', count: 103 }
{ _id: 'EntryNotFoundError', count: 77 }
{ _id: 'InvalidInputException', count: 76 }
{ _id: 'IOException', count: 72 }
{ _id: 'NotImplementedError', count: 69 }
{ _id: 'FileNotFoundError', count: 59 }
{ _id: 'ComputeError', count: 57 }
{ _id: 'NonMatchingSplitsSizesError', count: 50 }
{ _id: 'ColumnNotFoundError', count: 46 }
{ _id: 'RuntimeError', count: 34 }
{ _id: 'IndexError', count: 25 }
{ _id: 'ConversionException', count: 23 }
{ _id: 'HTTPException', count: 20 }
{ _id: 'ZeroDivisionError', count: 19 }
{ _id: 'LocationParseError', count: 15 }
{ _id: 'KeyError', count: 12 }
{ _id: 'BadZipFile', count: 11 }
{ _id: 'ArrowInvalid', count: 10 }
{ _id: 'ExpectedMoreSplits', count: 8 }
{ _id: 'ParserError', count: 8 }
{ _id: 'Error', count: 8 }
{ _id: 'InvalidOperationError', count: 7 }
{ _id: 'SchemaError', count: 5 }
{ _id: 'ReadError', count: 5 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'ArrowCapacityError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'DuplicateError', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TransactionException', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'UnicodeDecodeError', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'OutOfRangeException', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'NotPrimaryError', count: 1 }
{ _id: 'RepositoryNotFoundError', count: 1 }
{ _id: 'LocalEntryNotFoundError', count: 1 }
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kindkind: 'config-parquet-and-info' }, count: 9338 }
{ _id: { kindkind: 'split-descriptive-statistics' }, count: 2868 }
{ _id: { kindkind: 'split-duckdb-index' }, count: 847 }
{ _id: { kindkind: 'split-first-rows-from-parquet' }, count: 2 }
severo commented 8 months ago

I would bet that most errors occur for datasets with a script. I propose to recreate all of these datasets... In most cases, it will create a DatasetWithScriptNotSupportedError error instead of some weird-looking error.

Number of unique datasets:

db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError" } },
    { $group: { _id: null, uniqueValues: { $addToSet: "$dataset" } } },
    { $project: { _id: 0, uniqueValues: 1 } },
    { $unwind: "$uniqueValues" },
    { $group: { _id: null, count: { $sum: 1 } } },
    { $project: { _id: 0, count: 1 } }
]);
{ count: 7484 }

I'm recreating the datasets one by one, with:

DATASETS=(...)
for dataset in ${DATASETS[@]}; do curl -H "Authorization: Bearer $HF_TOKEN" -X POST "https://datasets-server.huggingface.co/admin/recreate-dataset?dataset=$dataset&priority=low"; done;

Scaled the admin service from 2 to 4, let's see if it improves something.

They are processing at a rate of 1 request per second (approximate value). So: hopefully in two hours we should be done

severo commented 8 months ago

Today:

number of datasets, by step and cause exception
db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } },
  {
    $group: {
      _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" },
      count: { $sum: 1 },
    },
  },
  { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } },
  { $sort: { "_id.kind": 1, count: -1 } },
  { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } 
]);
{ kind: 'config-parquet-and-info', num_datasets: 2486, cause: 'DatasetGenerationError' }
{ kind: 'config-parquet-and-info', num_datasets: 1226, cause: 'DatasetGenerationCastError' }
{ kind: 'config-parquet-and-info', num_datasets: 575, cause: 'OSError' }
{ kind: 'config-parquet-and-info', num_datasets: 64, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 32, cause: 'NotImplementedError' }
{ kind: 'config-parquet-and-info', num_datasets: 30, cause: 'NonMatchingSplitsSizesError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'RuntimeError' }
{ kind: 'config-parquet-and-info', num_datasets: 14, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 11, cause: 'HfHubHTTPError' }
{ kind: 'config-parquet-and-info', num_datasets: 8, cause: 'ParserError' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'BadZipFile' }
{ kind: 'config-parquet-and-info', num_datasets: 6, cause: 'ReadError' }
{ kind: 'config-parquet-and-info', num_datasets: 5, cause: 'ArrowCapacityError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'IndexError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'ExpectedMoreSplits' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'RarCannotExec' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'JSONDecodeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'AttributeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ModuleNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'FileNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'UnicodeDecodeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' }
{ kind: 'split-descriptive-statistics', num_datasets: 935, cause: 'TypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 56, cause: 'ValueError' }
{ kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ColumnNotFoundError' }
{ kind: 'split-descriptive-statistics', num_datasets: 12, cause: 'ComputeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 5, cause: 'InvalidOperationError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'SchemaError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' }
{ kind: 'split-duckdb-index', num_datasets: 123, cause: 'InvalidInputException' }
{ kind: 'split-duckdb-index', num_datasets: 109, cause: 'ParserException' }
{ kind: 'split-duckdb-index', num_datasets: 49, cause: 'IOException' }
{ kind: 'split-duckdb-index', num_datasets: 6, cause: 'ConversionException' }
{ kind: 'split-duckdb-index', num_datasets: 5, cause: 'Error' }
{ kind: 'split-duckdb-index', num_datasets: 2, cause: 'TypeMismatchException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }
AndreaFrancis commented 7 months ago

Today:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])

[
  { _id: { kind: 'config-parquet-and-info' }, count: 6215 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 2173 },
  { _id: { kind: 'split-duckdb-index' }, count: 2034 },
  { _id: { kind: 'split-duckdb-index-010' }, count: 777 },
  { _id: { kind: 'split-first-rows' }, count: 1 }
]
AndreaFrancis commented 7 months ago

Today:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}]) 
[
  { _id: { kind: 'config-parquet-and-info' }, count: 7373 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 3808 },
  { _id: { kind: 'split-duckdb-index' }, count: 3285 },
  { _id: { kind: 'split-first-rows' }, count: 206 }
]
AndreaFrancis commented 6 months ago

Today:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'config-parquet-and-info' }, count: 6668 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 3667 },
  { _id: { kind: 'split-duckdb-index' }, count: 2941 },
  { _id: { kind: 'dataset-loading-tags' }, count: 1539 },
  { _id: { kind: 'split-first-rows' }, count: 30 }
]
severo commented 4 months ago

The last PR (#2796) has a big impact!

72K -> 20K entries

Capture d’écran 2024-05-14 à 08 47 29 Capture d’écran 2024-05-14 à 08 47 35

Replaced with 36K DatasetGenerationError and 12K DatasetGenerationCastError

Capture d’écran 2024-05-14 à 08 49 38 Capture d’écran 2024-05-14 à 08 49 44
severo commented 4 months ago

Today:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kind: 'split-duckdb-index' }, count: 2871 }
{ _id: { kind: 'dataset-compatible-libraries' }, count: 2546 }
{ _id: { kind: 'split-descriptive-statistics' }, count: 1683 }
{ _id: { kind: 'config-parquet-and-info' }, count: 1407 }
{ _id: { kind: 'split-first-rows' }, count: 68 }
{ _id: { kind: 'split-image-url-columns' }, count: 2 }
AndreaFrancis commented 4 months ago

After refreshing some records:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'split-duckdb-index' }, count: 1380 },
  { _id: { kind: 'config-parquet-and-info' }, count: 1171 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 676 },
  { _id: { kind: 'dataset-compatible-libraries' }, count: 619 },
  { _id: { kind: 'split-first-rows' }, count: 68 },
  { _id: { kind: 'split-image-url-columns' }, count: 2 }
]
AndreaFrancis commented 4 months ago

Today (Almost half of yesterday's):

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'split-duckdb-index' }, count: 1236 },
  { _id: { kind: 'config-parquet-and-info' }, count: 588 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 301 },
  { _id: { kind: 'dataset-compatible-libraries' }, count: 209 },
  { _id: { kind: 'split-first-rows' }, count: 68 },
  { _id: { kind: 'split-image-url-columns' }, count: 2 }
]

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.countDocuments({error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}})
2405
severo commented 2 months ago

Today:

db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } },
  {
    $group: {
      _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" },
      count: { $sum: 1 },
    },
  },
  { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } },
  { $sort: { count: -1, "_id.kind": 1 } },
  { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } 
]);

{ kind: 'dataset-compatible-libraries', num_datasets: 1507, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 288, cause: 'ParserException' }
{ kind: 'split-duckdb-index', num_datasets: 262, cause: 'HfHubHTTPError' }
{ kind: 'config-parquet-and-info', num_datasets: 203, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 181, cause: 'UnidentifiedImageError' }
{ kind: 'dataset-filetypes', num_datasets: 160, cause: 'BadZipFile' }
{ kind: 'split-descriptive-statistics', num_datasets: 157, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 148, cause: 'PermissionError' }
{ kind: 'split-duckdb-index', num_datasets: 144, cause: 'BinderException' }
{ kind: 'dataset-filetypes', num_datasets: 140, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 134, cause: 'ReadTimeout' }
{ kind: 'split-descriptive-statistics', num_datasets: 121, cause: 'ValueError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 96, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 93, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 77, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 73, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 69, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 65, cause: 'RuntimeError' }
{ kind: 'config-parquet-and-info', num_datasets: 52, cause: 'ReadError' }
{ kind: 'split-first-rows', num_datasets: 52, cause: 'ServerDisconnectedError' }
{ kind: 'split-duckdb-index', num_datasets: 50, cause: 'SchemaError' }
{ kind: 'split-duckdb-index', num_datasets: 49, cause: 'ComputeError' }
{ kind: 'split-duckdb-index', num_datasets: 48, cause: 'InvalidInputException' }
{ kind: 'config-parquet-and-info', num_datasets: 44, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 42, cause: 'ColumnNotFoundError' }
{ kind: 'split-descriptive-statistics', num_datasets: 40, cause: 'ColumnNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 40, cause: 'TypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ConnectionError' }
{ kind: 'split-duckdb-index', num_datasets: 32, cause: 'EntryNotFoundError' }
{ kind: 'dataset-filetypes', num_datasets: 31, cause: 'TypeError' }
{ kind: 'split-first-rows', num_datasets: 28, cause: 'ClientResponseError' }
{ kind: 'config-parquet-and-info', num_datasets: 25, cause: 'NonMatchingSplitsSizesError' }
{ kind: 'config-parquet-and-info', num_datasets: 24, cause: 'ArrowTypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 24, cause: 'EntryNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 24, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 21, cause: 'FileNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 19, cause: 'KeyError' }
{ kind: 'dataset-filetypes', num_datasets: 19, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 19, cause: 'DecompressionBombError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'DatasetGenerationError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'BadZipFile' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'IndexError' }
{ kind: 'split-descriptive-statistics', num_datasets: 14, cause: 'ComputeError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'ParserError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'NotImplementedError' }
{ kind: 'config-parquet-and-info', num_datasets: 11, cause: 'ArrowCapacityError' }
{ kind: 'dataset-filetypes', num_datasets: 11, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 10, cause: 'IOException' }
{ kind: 'split-first-rows', num_datasets: 10, cause: 'AttributeError' }
{ kind: 'split-first-rows', num_datasets: 9, cause: 'OSError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'KeyError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'split-first-rows', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'OSError' }
{ kind: 'split-first-rows', num_datasets: 7, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 6, cause: 'JSONDecodeError' }
{ kind: 'split-duckdb-index', num_datasets: 6, cause: 'InternalException' }
{ kind: 'split-image-url-columns', num_datasets: 6, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 5, cause: 'HTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 5, cause: 'ConversionException' }
{ kind: 'config-parquet-and-info', num_datasets: 4, cause: 'DatasetGenerationCastError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'InvalidOperationError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 4, cause: 'TypeMismatchException' }
{ kind: 'split-first-rows', num_datasets: 4, cause: 'FSTimeoutError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'UnpicklingError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'ExpectedMoreSplits' }
{ kind: 'split-duckdb-index', num_datasets: 3, cause: 'Error' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 2, cause: 'ValueError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'SchemaError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'KeyError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ChunkedEncodingError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'IsADirectoryError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EmptyDataError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EOFError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 1, cause: 'EmptyDatasetError' }
{ kind: 'dataset-filetypes', num_datasets: 1, cause: 'ConnectionError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'TypeError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'error' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'OutOfMemoryException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientConnectorError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'UnicodeDecodeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientPayloadError' }
severo commented 2 months ago

Note that we currently have 14K UnexpectedError entries, which is about 0.1% of the total cache entries. So: not that crucial either. I'll reduce the priority.

Maybe more important is to replace ConfigNamesError with the underlying error (100K entries). And to explicit more the DatasetGenerationError (50K entries) to help users debug their data files.

severo commented 2 months ago

I created https://github.com/huggingface/dataset-viewer/issues/3010 and https://github.com/huggingface/dataset-viewer/issues/3011