voxel51 / fiftyone

Refine high-quality datasets and visual AI models
https://fiftyone.ai
Apache License 2.0
8.85k stars 557 forks source link

Server needs interface method to serialize samples efficiently #186

Open brimoor opened 4 years ago

brimoor commented 4 years ago

This PR https://github.com/voxel51/fiftyone/pull/169 updated the behavior of fiftyone.core.sample.Sample.to_dict(), which caused the server to have to serialize samples in a non-ideal bson -> str -> json way: https://github.com/voxel51/fiftyone/pull/185/commits/85399782a5aff92f4fafe1799642d9250008926e.

The interface needs to be tweaked to ensure that all relevant use cases of sample serialization are supported cleanly.

tylerganter commented 4 years ago

Patch fix here: #185

tylerganter commented 4 years ago

@benjaminpkane am I correct in saying the only issue with Sample.to_dict() is that the id is missing?

import fiftyone as fo

from pprint import pprint

import json
from bson import json_util

dataset = fo.Dataset("cifar10-test")

dataset.add_sample(
    fo.Sample(
        filepath='/Users/tylerganter/fiftyone/cifar10/test/data/00041.jpg',
        tags=['test'],
        ground_truth=fo.Classification(label="deer")
    )
)

s = dataset.view().first()
pprint(s.to_dict())
pprint(json.loads(json_util.dumps(s.to_mongo())))

OUT:

{'filepath': '/Users/tylerganter/fiftyone/cifar10/test/data/00041.jpg',
 'ground_truth': {'_cls': 'Classification', 'label': 'deer'},
 'metadata': None,
 'tags': ['test']}
{'_id': {'$oid': '5ee23bdaa2cfbd9cc9ae5d82'},
 'filepath': '/Users/tylerganter/fiftyone/cifar10/test/data/00041.jpg',
 'ground_truth': {'_cls': 'Classification', 'label': 'deer'},
 'metadata': None,
 'tags': ['test']}
benjaminpkane commented 4 years ago

Yes, that is the only thing I saw

tylerganter commented 4 years ago

Okay. This will be an easy fix. I do have to say though, that I'm not a big fan in general of the paradigm we have for needing to support both user functionality and server functionality. Its almost as if we need a special type of private method that supports server functionality.

I did not think about this hard and deep at all, but I'll just through it out there: What if we have a child class for the server version of everything:

class ServerSample(Sample):
    def to_server_dict():
        """serializes for the server"""
        ...

Such that all of this functionality is hidden from the user.

benjaminpkane commented 4 years ago

Why do we need a class? I'm fine with a private method.

Up to you, but the reason I need id's right now is for cache busting images. I could probably change that. Are we hiding the entire notion of an id to the user?

tylerganter commented 4 years ago

Why do we need a class? I'm fine with a private method.

Up to you, but the reason I need id's right now is for cache busting images. I could probably change that. Are we hiding the entire notion of an id to the user?

The logic I went with, was that Sample.to_from/dict() are the python dictionary equivalents of to/from_json(). And:

fo.Sample.from_dict(sample_in_dataset.to_dict())
# new sample, NOT in dataset

fo.Sample.from_json(sample_in_dataset.from_json())
# new sample, NOT in dataset

If we started writing IDs to file we'd start getting a mess where we would have to check for an ID and ignore it if present which means you're not really loading this thing that you have serialized...

Thus I write support for Sample.to_mongo_dict which does NOT have a corresponding Sample.from_mongo_dict because the inverse operation would really be:

d = sample.to_mongo_dict()
sample2 = dataset[d["_id"]]
sample2 is sample
# True

While we have been using a mixture of private and public methods on the server, private methods shouldn't be used this way. There is no way for us to distinguish:

fo.Sample._method_that_should_only_be_called_internally_by_Sample()
fo.Sample._method_that_should_only_be_called_by_server()
brimoor commented 4 years ago

kibitz: I agree that, in an ideal world, we would avoid _method_that_should_only_be_called_by_server(). But, that pattern already exists a few dozen times in the codebase, so adding another here and moving onto more pressing things seems right