lidatong / dataclasses-json

Easily serialize Data Classes to and from JSON
MIT License
1.34k stars 150 forks source link

[BUG] warning message dumps large amount of data to stdout #537

Open s4ke opened 2 weeks ago

s4ke commented 2 weeks ago

Description

We just ran into the case where the warnings logged out huge chunks of actual data into stdout. While I understand where this comes from, this is a bad idea because this can leak PII data into stdout/stderr by accident. Plus: The old behaviour was "just working as intended" for our usecase.

In our case, this was caused by this class: https://github.com/neuroforgede/nfcompose/blob/1ad30313e1bdbdb7c3d8e35fd74f905924e2003e/client/compose_client/library/models/definition/datapoint.py#L32

Note that we are using dataclass_json in serialization, but have extra code preventing the non primitive data reaching the actual serialization into json.

Example log (from python 3.8):

/home/<....>/integration/function_layout/compose/venv/lib/python3.8/site-packages/dataclasses_json/core.py:342: UserWarning: Failed to decode {'data': {'season': '', 'branchCombination': '',

Code snippet that reproduces the issue

# This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. 
# If a copy of the MPL was not distributed with this file, 
# You can obtain one at https://mozilla.org/MPL/2.0/.
# This file is part of NF Compose
# [2019] - [2024] © NeuroForge GmbH & Co. KG

from dataclasses import dataclass, field
from typing import Any, Dict, Union, BinaryIO

from dataclasses_json import dataclass_json, Undefined

from compose_client.library.models.definition.data_series_definition import DataSeriesDefinition
from compose_client.library.models.identifiable import Identifiable
from compose_client.library.models.raw.datapoint import RawDataPoint

# explicitly no dataclass so we dont forget to implement a proper serializer
class FileTypeContent:
    url: str

    def __init__(self, url: str):
        self.url = url

Primitive = Union[str, float, int, bool]

@dataclass_json(undefined=Undefined.EXCLUDE)
@dataclass
class DataPoint(Identifiable):
    external_id: str
    payload: Dict[str, Union[Primitive, FileTypeContent, BinaryIO]]
    identify_dimensions_by_external_id: bool = field(default=True)

    @staticmethod
    def from_raw(raw: RawDataPoint, definition: DataSeriesDefinition) -> 'DataPoint':
        payload = raw.payload.copy()
        for file_like_fact in [*definition.structure.file_facts, *definition.structure.image_facts]:
            if file_like_fact.external_id in payload:
                payload[file_like_fact.external_id] = FileTypeContent(url=payload[file_like_fact.external_id])
        return DataPoint(
            external_id=raw.external_id,
            payload=payload
        )

    def to_dict(self) -> Any: ...

    @staticmethod
    def from_dict(dict: Dict[str, Any]) -> 'DataPoint': ...

Describe the results you expected

The warning should not log out the data unprompted. This is a data security issue.

Python version you are using

Python 3.10.12

Environment description

certifi==2024.6.2 charset-normalizer==3.3.2 click==8.1.7 compose_client @ https://github.com/neuroforgede/nfcompose/releases/download/2.2.1/compose_client-2.2.1.tar.gz#sha256=08b5d99570e34734b1c5938c26cd57b456282443961d19a69754a096b8f8b14d dataclasses-json==0.6.7 idna==3.7 marshmallow==3.21.3 mypy-extensions==1.0.0 packaging==24.1 requests==2.32.3 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.2

s4ke commented 2 weeks ago

I think we should improve the code on our end. I am just pointing out that maybe data should not be logged out to stdout by default.

george-zubrienko commented 2 weeks ago

@DavidCain do you think it will be possible to turn warnings off with a config flag?

DavidCain commented 2 weeks ago

@george-zubrienko - oh absolutely, one could definitely make a config flag if desired. However, filterwarnings can also be used right now to silence warnings on dataclasses-json==0.6.7 (perhaps that's helpful to you, @s4ke ?)

I think @s4ke makes a solid point though about having sensible logging as the default -- it's probably not wise to log the entirety of the payload when the warning message simply needs to say that the Union wasn't properly handled (the traceback can be used for more information).

Just to be clear, though, I believe this is a different warning than the one I recently added. The warning message in this bug report comes from: https://github.com/lidatong/dataclasses-json/blob/8973c336d230387f948ece3c90287fb60c1942b9/dataclasses_json/core.py#L337-L338

The warning message I wrote should only ever mention the type annotation, and I would hope that PII or other sensitive information isn't in a type annotation.

DavidCain commented 2 weeks ago

tl;dr: I imagine you could solve this concern with a smaller warning message:


                    warnings.warn(
-                        f"Failed to decode {value} Union dataclasses."
+                        "Failed to decode Union dataclasses. "
-                        f"Expected Union to include a matching dataclass and it didn't."
+                        "Expected Union to include a matching dataclass and it didn't."
                    )
DavidCain commented 2 weeks ago

@s4ke you'd originally commented on my PR:

We just ran into the case where according to this PR buggy behaviour logged out huge chunks of actual data into stdout. While I understand where this comes from, this is a bad idea because this can leak PII data into stdout by accident. Plus: The old behaviour was "just working as intended" for our usecase.

I assume you deleted the comment because the warning is actually coming from a different part of the codebase? (Let me know if you'd prefer me not repeat your comments here, I can edit to delete).

s4ke commented 1 week ago

Its fine. The comment didnt make sense after all. After some digging I found that the issue was not introduced by new code but rather already existed for a while. Turns out the Code path that caused this issue was not really known yet.

This was an error on my part. Sorry about that. I deleted my message because it was nonsense and completely in the wrong place. And to be fair, the tone was out of place as well.