fix: :art: Parquet data type fixed

jeremyarancio commented 3 weeks ago

What

Timestamps in the JSONL/Parquet were in the wrong format: Unix timestamp
Using DuckDb, this problem is now fixed in the Parquet file => HF

Screenshot

Part of

https://github.com/openfoodfacts/openfoodfacts-server/issues/7660

jeremyarancio commented 3 weeks ago

This PR should also contain the fix for images but it is not done for 2 reasons:

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...
The data contained in the field images is actually not that relevant for users.

raphael0202 commented 3 weeks ago

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:

{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}

The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)

jeremyarancio commented 3 weeks ago

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:
{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}
The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)

Ok, if it is actually useful

jeremyarancio commented 3 weeks ago

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:
{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}
The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)
Ok, if it is actually useful

I tried different methods, also what you proposed, but I cannot manage to get the field schema recognized by Parquet... Here's the "list of dict" solution in DuckDB:

CREATE OR REPLACE TABLE processed_images AS (
        WITH processed_images AS (
            SELECT code,
                json_merge_patch(
                    to_json({'key': unnest(json_keys(images)) }),
                    unnest(images->'$.*')
                ) AS merged
            FROM jsonl_sample
        )
        SELECT 
            code,
            array_agg(merged) AS images_array
        FROM processed_images
        GROUP BY code
    )
;
SELECT 
        jsonl.code,
        <...>, 
        processed_images.images_array
FROM jsonl_sample AS jsonl
LEFT JOIN processed_images
ON jsonl.code = processed_images.code;

raphael0202 commented 2 weeks ago

A possibility would be to iterate over the JSONL from Python code, post-process the data and create the parquet file directly by submitting the data and the schema, no?

jeremyarancio commented 1 week ago

The type of problem I'm dealing with, in addition to how large the dataset is and other bugs... -> Data conversion is not done right because the data is not consistent across the JSONL

raphael0202 commented 1 week ago

Data conversion is not done right because the data is not consistent across the JSONL

What's not consistent for example?

jeremyarancio commented 1 week ago

Data conversion is not done right because the data is not consistent across the JSONL

What's not consistent for example?

Check above, the data is either a string or an int for the same item

raphael0202 commented 1 week ago

Ok, if think you should use a data validation library such as pydantic (https://pydantic.dev/) to validate the input data and force the correct type! Pydantic should be fast enough for this purpose.

jeremyarancio commented 1 week ago

Pydantic could be useful later on to impose a data contract to the JSONL, to prevent any modification in the data schema before the conversion process. Who knows how the JSONL will change in months But it doesn't "force" the correct type. The correct type needs to be manually converted, like this example below.

I'll handle future incorrect JSONL field type using try & except for each column, except leading the column not being converted with a Warning message.

[
                    {
                        "key": str(key),
                        "imgid": str(value.get("imgid", "unknown")),
                        "sizes": {
                            "100": {
                                "h": int(value.get("sizes", {}).get("100", {}).get("h", 0) or 0), # (or 0) because "h" or "w" can be none, leading to an error with int
                                "w": int(value.get("sizes", {}).get("100", {}).get("w", 0) or 0),
                            },
                            "200": {
                                "h": int(value.get("sizes", {}).get("200", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("200", {}).get("w", 0) or 0),
                            },
                            "400": {
                                "h": int(value.get("sizes", {}).get("400", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("400", {}).get("w", 0) or 0),
                            },
                            "full": {
                                "h": int(value.get("sizes", {}).get("full", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("full", {}).get("w", 0) or 0),
                            },
                        },
                        "uploaded_t": str(value.get("uploaded_t", "unknown")),
                        "uploader": str(value.get("uploader", "unknown")),
                    }
                    for key, value in image.items()
                ]

raphael0202 commented 1 week ago

But it doesn't "force" the correct type.

It can definitely do that (note the inconsistent types for h and w in "sizes"):

from pydantic import BaseModel
from typing import Literal

class ImageSize(BaseModel):
    h: int
    w: int

class Image(BaseModel):
    key: str
    sizes: dict[Literal["100", "200", "400", "full"], ImageSize]
    uploaded_t: int
    imgid: int | None = None
    uploader: str | None = None

raw_data = {
    "key": "nutrition_fr",
    "imgid": "3",
    "sizes": {
        "100": {"h": 100, "w": 100},
        "200": {"h": "200", "w": 200},
        "full": {"h": 600, "w": "600"}
    },
    "uploaded_t": 1620000000,
}

image = Image(**raw_data)

print(image.model_dump())

# Output:
# {'key': 'nutrition_fr', 'sizes': {'100': {'h': 100, 'w': 100}, '200': {'h': 200, 'w': 200}, 'full': {'h': 600, 'w': 600}}, 'uploaded_t': 1620000000, 'imgid': 3, 'uploader': None}

Pydantic is a validation library, so it can converts automatically from a string to an int, as long as the conversion is possible and makes sense.

jeremyarancio commented 1 week ago

That's actually cleaner indeed!

jeremyarancio commented 1 week ago

openfoodfacts/openfoodfacts-exports#6 Full postprocessing works 🔥 Still would like to add unittests and exception handling (draft)

openfoodfacts / robotoff