Closed jeremyarancio closed 1 week ago
This PR should also contain the fix for images
but it is not done for 2 reasons:
images
is actually not that relevant for users.The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...
We should then reformat the images
field so that it is Parquet compatible. One possibility would be to make images
a list of dicts such as:
{
"sizes": {
"100": {
"h": 100,
"w": 56
},
"400": {
"h": 400,
"w": 225
},
"full": {
"h": 3555,
"w": 2000
}
},
"uploaded_t": "1490702616",
"uploader": "twan51"
"imgid": "5",
"key": "front_fr"
}
The difference here is that the dictionary key would be part of the item directly.
The data contained in the field images is actually not that relevant for users.
It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)
The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...
We should then reformat the
images
field so that it is Parquet compatible. One possibility would be to makeimages
a list of dicts such as:{ "sizes": { "100": { "h": 100, "w": 56 }, "400": { "h": 400, "w": 225 }, "full": { "h": 3555, "w": 2000 } }, "uploaded_t": "1490702616", "uploader": "twan51" "imgid": "5", "key": "front_fr" }
The difference here is that the dictionary key would be part of the item directly.
The data contained in the field images is actually not that relevant for users.
It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)
Ok, if it is actually useful
The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...
We should then reformat the
images
field so that it is Parquet compatible. One possibility would be to makeimages
a list of dicts such as:{ "sizes": { "100": { "h": 100, "w": 56 }, "400": { "h": 400, "w": 225 }, "full": { "h": 3555, "w": 2000 } }, "uploaded_t": "1490702616", "uploader": "twan51" "imgid": "5", "key": "front_fr" }
The difference here is that the dictionary key would be part of the item directly.
The data contained in the field images is actually not that relevant for users.
It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)
Ok, if it is actually useful
I tried different methods, also what you proposed, but I cannot manage to get the field schema recognized by Parquet... Here's the "list of dict" solution in DuckDB:
CREATE OR REPLACE TABLE processed_images AS (
WITH processed_images AS (
SELECT code,
json_merge_patch(
to_json({'key': unnest(json_keys(images)) }),
unnest(images->'$.*')
) AS merged
FROM jsonl_sample
)
SELECT
code,
array_agg(merged) AS images_array
FROM processed_images
GROUP BY code
)
;
SELECT
jsonl.code,
<...>,
processed_images.images_array
FROM jsonl_sample AS jsonl
LEFT JOIN processed_images
ON jsonl.code = processed_images.code;
A possibility would be to iterate over the JSONL from Python code, post-process the data and create the parquet file directly by submitting the data and the schema, no?
The type of problem I'm dealing with, in addition to how large the dataset is and other bugs... -> Data conversion is not done right because the data is not consistent across the JSONL
Data conversion is not done right because the data is not consistent across the JSONL
What's not consistent for example?
Data conversion is not done right because the data is not consistent across the JSONL
What's not consistent for example?
Check above, the data is either a string or an int for the same item
Ok, if think you should use a data validation library such as pydantic (https://pydantic.dev/) to validate the input data and force the correct type! Pydantic should be fast enough for this purpose.
Pydantic could be useful later on to impose a data contract to the JSONL, to prevent any modification in the data schema before the conversion process. Who knows how the JSONL will change in months But it doesn't "force" the correct type. The correct type needs to be manually converted, like this example below.
I'll handle future incorrect JSONL field type using try & except for each column, except
leading the column not being converted with a Warning message.
[
{
"key": str(key),
"imgid": str(value.get("imgid", "unknown")),
"sizes": {
"100": {
"h": int(value.get("sizes", {}).get("100", {}).get("h", 0) or 0), # (or 0) because "h" or "w" can be none, leading to an error with int
"w": int(value.get("sizes", {}).get("100", {}).get("w", 0) or 0),
},
"200": {
"h": int(value.get("sizes", {}).get("200", {}).get("h", 0) or 0),
"w": int(value.get("sizes", {}).get("200", {}).get("w", 0) or 0),
},
"400": {
"h": int(value.get("sizes", {}).get("400", {}).get("h", 0) or 0),
"w": int(value.get("sizes", {}).get("400", {}).get("w", 0) or 0),
},
"full": {
"h": int(value.get("sizes", {}).get("full", {}).get("h", 0) or 0),
"w": int(value.get("sizes", {}).get("full", {}).get("w", 0) or 0),
},
},
"uploaded_t": str(value.get("uploaded_t", "unknown")),
"uploader": str(value.get("uploader", "unknown")),
}
for key, value in image.items()
]
But it doesn't "force" the correct type.
It can definitely do that (note the inconsistent types for h
and w
in "sizes"):
from pydantic import BaseModel
from typing import Literal
class ImageSize(BaseModel):
h: int
w: int
class Image(BaseModel):
key: str
sizes: dict[Literal["100", "200", "400", "full"], ImageSize]
uploaded_t: int
imgid: int | None = None
uploader: str | None = None
raw_data = {
"key": "nutrition_fr",
"imgid": "3",
"sizes": {
"100": {"h": 100, "w": 100},
"200": {"h": "200", "w": 200},
"full": {"h": 600, "w": "600"}
},
"uploaded_t": 1620000000,
}
image = Image(**raw_data)
print(image.model_dump())
# Output:
# {'key': 'nutrition_fr', 'sizes': {'100': {'h': 100, 'w': 100}, '200': {'h': 200, 'w': 200}, 'full': {'h': 600, 'w': 600}}, 'uploaded_t': 1620000000, 'imgid': 3, 'uploader': None}
Pydantic is a validation library, so it can converts automatically from a string to an int, as long as the conversion is possible and makes sense.
That's actually cleaner indeed!
openfoodfacts/openfoodfacts-exports#6 Full postprocessing works 🔥 Still would like to add unittests and exception handling (draft)
What
Screenshot
Part of