voxel51 / fiftyone

Refine high-quality datasets and visual AI models
https://fiftyone.ai
Apache License 2.0
8.84k stars 558 forks source link

[FR] Proposal: use estimated document count for Dataset.__len__() and Dataset.summary() #3075

Open swheaton opened 1 year ago

swheaton commented 1 year ago

Proposal Summary

Proposing we use collStats instead of count aggregation for Dataset.__len__() and Dataset.summary() in some or all cases. This approach can only be used for dataset length because it's not possible with any view / filter applied.

Motivation

What areas of FiftyOne does this feature affect?

Details

Dataset.stats() uses collStats, equivalent to estimated_document_count. According to MongoDB documentation, this is considered to be an estimated count but is usually quite accurate. Emphasis mine

The count is based on the collection's metadata, which provides a fast but sometimes inaccurate count for sharded clusters.

After an unclean shutdown, the count may be incorrect.

After an unclean shutdown of a mongod using the Wired Tiger storage engine, count statistics reported by db.collection.estimatedDocumentCount() may be inaccurate.

The amount of drift depends on the number of insert, update, or delete operations performed between the last checkpoint and the unclean shutdown. Checkpoints usually occur every 60 seconds. However, mongod instances running with non-default --syncdelay settings may have more or less frequent checkpoints.

So with typical MongoDB settings and ignoring unclean shutdown situations, getting the length of a dataset in this way should be accurate to within 60s. Given the high read-to-write ratio of most FiftyOne users, this is probably accurate enough in almost all cases.

One potential mitigation to the above referenced unlikely cases is to first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count() because it is unlikely to have a big performance impact.

Potential issue

One issue with this is for group datasets. Currently length is determined as number of groups not number of samples. collStats by its nature can only determine number of samples as it has no concept of FiftyOne groups, only documents.

Options:

  1. Accept printing or getting length of large group datasets will be slow
  2. Change behavior of length of grouped datasets to be number of samples not number of groups

Rebuttal

But user can just do dataset.stats()["samples_count"] on their own if they want it to be fast, we don't need to change len(dataset)?

This is much less natural and requires knowledge of the internal mongodb methods being used for each. Plus there is no option to do this for print(dataset) / dataset.summary().

Willingness to contribute

The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?

brimoor commented 1 year ago

Using estimated count for summary() and hence print(dataset) definitely makes sense. It's just informational.

A concern with using an inexact count for dataset.count() and hence len(dataset) is that users may be using this count for operations that require exactness.

For example, if len(dataset) is not exact, users may run into unhappiness when using certain batch update patterns:

dataset.set_values("foo", ["bar"] * len(dataset))

Now, in a multiuser environment like Teams, one should not really be doing operations like that anyway. You should structure your operations based on immutable things like IDs so that the code is robust to coincident unrelated updates:

dataset.set_values("foo", {"id1": "bar", "id2": "spam", "id3": "eggs", ...}, key_field="id")

But, in OSS, since its a local dataset and you're generally the only user (is database sharing the new Netflix password sharing? 🤣) it would be reasonable to expect that things like this should work:

# lots of samples
samples = ...

dataset = fo.Dataset()
dataset.add_samples(samples)

# oops I forgot a field. Ah I know, I'll add it efficiently via batching!
values = [... for _ in range(len(dataset))]
dataset.set_values("foo", values)
swheaton commented 1 year ago

Yeah that makes sense. It's more exact than the name "estimate" makes it seem, but yeah to rely on length, we will have to keep it using the slow way unfortunately.

This was my address to this point originally

One potential mitigation to the above referenced unlikely cases is to first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count() because it is unlikely to have a big performance impact.

But users could still write a script that uses len on a dataset size N+1 and it may or may not work as expected.

I will submit a PR that uses stats() for summary() then but NOT count() or __len__

shortcipher3 commented 1 year ago

I liked the suggestion to to: "first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count()"

Following up on the documentation @swheaton shared (emphasis mine):

"The count is based on the collection's metadata, which provides a fast but sometimes inaccurate count for sharded clusters."

For a local dataset - I wouldn't expect it to be using sharded clusters - so I expect this to be a non-issue.

If I'm being naive, here is another possible workaround - db.printShardingStatus or sh.status to find the sharding status of the mongodb maybe at fiftyone load time. Then use that to determine if the more accurate dataset.count() should be used. Maybe other operations will also be impacted by sharding.

Curious if you have lots of customers using sharding?

shortcipher3 commented 1 year ago

I guess the unclean shutdown is still a possible issue though.