voxel51 / fiftyone

Refine high-quality datasets and visual AI models
https://fiftyone.ai
Apache License 2.0
8.89k stars 563 forks source link

[BUG] Cause data duplicated when merging grouped samples and using key_fcn. #3763

Closed dahsing closed 1 year ago

dahsing commented 1 year ago

System information

Describe the problem

When I attempted to merge a grouped samples into datasets using the key_fcn method, the merging result produced redundant data.

Code to reproduce issue

import fiftyone as fo

# create and define dataset
dataset_name="test"
dataset = fo.Dataset(name=dataset_name,persistent=True) 

dataset.add_group_field('group')
dataset.add_sample_field('sid', fo.StringField)
dataset.add_sample_field('key', fo.StringField)

# generate samples
samples=[]
for x in range(1):
    group = fo.Group()
    sample01 = fo.Sample(
        filepath=f's{x}.jpeg',
        sid=f's{x}',
        key=f's{x}.jpeg',
        group=group.element('jpeg')
    )
    sample02 = fo.Sample(
        filepath=f's{x}.pcd',
        sid=f's{x}',
        key=f's{x}.pcd',
        group=group.element('pcd')
    )
    samples.extend([sample01,sample02])

# add samples to dataset , and print the result
dataset.add_samples(samples)
for element in dataset.group_slices:
    dataset.group_slice = element
    [print(sample) for sample in dataset]  

# reload the dataset , and merge_samples to dataset using key_fcn
def _key_fcn(sample):
   key = f"{sample['filepath']}-{sample['sid']}"
   # either return key or filepath was the same
   # return key 
   return sample.filepath

dataset = fo.load_dataset(dataset_name)

# merge samples by key_fcn
# cause the bug
dataset.merge_samples(samples,key_fcn=_key_fcn)

# merge samples by key_field
# not cause the bug, when use key_field
# dataset.merge_samples(samples,key_field="key")

# print the merged dataset
for element in dataset.group_slices:
    dataset.group_slice = element 
   [print(sample) for sample in dataset]

The print result running the code above.

We can find the result below, after add_samples the dataset had 2 samples, but after merge_samples the dataset had 3 samples.

# python issue.py

# samples after add.
<Sample: {
    'id': '65435430a9da4d6966c2a9d2',
    'media_type': 'image',
    'filepath': '/workspace/s0.jpeg',
    'tags': [],
    'metadata': None,
    'group': <Group: {'id': '65435430a9da4d6966c2a9d1', 'name': 'jpeg'}>,
    'sid': 's0',
    'key': 's0.jpeg',
}>

<Sample: {
    'id': '65435430a9da4d6966c2a9d3',
    'media_type': 'point-cloud',
    'filepath': '/workspace/s0.pcd',
    'tags': [],
    'metadata': None,
    'group': <Group: {'id': '65435430a9da4d6966c2a9d1', 'name': 'pcd'}>,
    'sid': 's0',
    'key': 's0.pcd',
}>

Indexing dataset...
Merging samples...

# samples after merge, have a duplicated sample
<Sample: {
    'id': '65435430a9da4d6966c2a9d2',
    'media_type': 'image',
    'filepath': '/workspace/s0.jpeg',
    'tags': [],
    'metadata': None,
    'group': <Group: {'id': '65435430a9da4d6966c2a9d1', 'name': 'jpeg'}>,
    'sid': 's0',
    'key': 's0.jpeg',
}>
<Sample: {
    'id': '65435430a9da4d6966c2a9d4',
    'media_type': 'image',
    'filepath': '/workspace/s0.jpeg',
    'tags': [],
    'metadata': None,
    'group': <Group: {'id': '65435430a9da4d6966c2a9d1', 'name': 'jpeg'}>,
    'sid': 's0',
    'key': 's0.jpeg',
}>
samples after merge.
<Sample: {
    'id': '65435430a9da4d6966c2a9d3',
    'media_type': 'point-cloud',
    'filepath': '/workspace/s0.pcd',
    'tags': [],
    'metadata': None,
    'group': <Group: {'id': '65435430a9da4d6966c2a9d1', 'name': 'pcd'}>,
    'sid': 's0',
    'key': 's0.pcd',
}>

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the FiftyOne codebase?

brimoor commented 1 year ago

@dahsing thanks for catching this! Fixed by https://github.com/voxel51/fiftyone/pull/3816