[FR] Proposal: use estimated document count for Dataset.__len__() and Dataset.summary()

swheaton commented 1 year ago

Proposal Summary

Proposing we use collStats instead of count aggregation for Dataset.__len__() and Dataset.summary() in some or all cases. This approach can only be used for dataset length because it's not possible with any view / filter applied.

Motivation

What is the use case for this feature?
- Dataset.count() is extremely slow for datasets with large number of samples, therefore print(dataset) is also slow
- Seeing the number of samples in a dataset is a common and fundamental operation to perform on a dataset through the SDK, as well as simply printing it to get a quick view of the dataset. If the dataset is large though, it is not so quick.
- Customer reports of getting length taking 10+ minutes for 6M sample dataset. Rarely would an exact count be useful over the approximation.

What areas of FiftyOne does this feature affect?

[ ] App: FiftyOne application
[x] Core: Core fiftyone Python library
[ ] Server: FiftyOne server

Details

Dataset.stats() uses collStats, equivalent to estimated_document_count. According to MongoDB documentation, this is considered to be an estimated count but is usually quite accurate. Emphasis mine

The count is based on the collection's metadata, which provides a fast but sometimes inaccurate count for sharded clusters.

After an unclean shutdown, the count may be incorrect.

After an unclean shutdown of a mongod using the Wired Tiger storage engine, count statistics reported by db.collection.estimatedDocumentCount() may be inaccurate.

The amount of drift depends on the number of insert, update, or delete operations performed between the last checkpoint and the unclean shutdown. Checkpoints usually occur every 60 seconds. However, mongod instances running with non-default --syncdelay settings may have more or less frequent checkpoints.

So with typical MongoDB settings and ignoring unclean shutdown situations, getting the length of a dataset in this way should be accurate to within 60s. Given the high read-to-write ratio of most FiftyOne users, this is probably accurate enough in almost all cases.

One potential mitigation to the above referenced unlikely cases is to first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count() because it is unlikely to have a big performance impact.

Potential issue

One issue with this is for group datasets. Currently length is determined as number of groups not number of samples. collStats by its nature can only determine number of samples as it has no concept of FiftyOne groups, only documents.

Options:

Accept printing or getting length of large group datasets will be slow
Change behavior of length of grouped datasets to be number of samples not number of groups

Rebuttal

But user can just do dataset.stats()["samples_count"] on their own if they want it to be fast, we don't need to change len(dataset)?

This is much less natural and requires knowledge of the internal mongodb methods being used for each. Plus there is no option to do this for print(dataset) / dataset.summary().

Willingness to contribute

The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?

[x] Yes. I can contribute this feature independently
[ ] Yes. I would be willing to contribute this feature with guidance from the FiftyOne community
[ ] No. I cannot contribute this feature at this time

brimoor commented 1 year ago

Using estimated count for summary() and hence print(dataset) definitely makes sense. It's just informational.

A concern with using an inexact count for dataset.count() and hence len(dataset) is that users may be using this count for operations that require exactness.

For example, if len(dataset) is not exact, users may run into unhappiness when using certain batch update patterns:

dataset.set_values("foo", ["bar"] * len(dataset))

Now, in a multiuser environment like Teams, one should not really be doing operations like that anyway. You should structure your operations based on immutable things like IDs so that the code is robust to coincident unrelated updates:

dataset.set_values("foo", {"id1": "bar", "id2": "spam", "id3": "eggs", ...}, key_field="id")

But, in OSS, since its a local dataset and you're generally the only user (is database sharing the new Netflix password sharing? 🤣) it would be reasonable to expect that things like this should work:

# lots of samples
samples = ...

dataset = fo.Dataset()
dataset.add_samples(samples)

# oops I forgot a field. Ah I know, I'll add it efficiently via batching!
values = [... for _ in range(len(dataset))]
dataset.set_values("foo", values)

swheaton commented 1 year ago

Yeah that makes sense. It's more exact than the name "estimate" makes it seem, but yeah to rely on length, we will have to keep it using the slow way unfortunately.

This was my address to this point originally

One potential mitigation to the above referenced unlikely cases is to first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count() because it is unlikely to have a big performance impact.

But users could still write a script that uses len on a dataset size N+1 and it may or may not work as expected.

I will submit a PR that uses stats() for summary() then but NOT count() or __len__

shortcipher3 commented 1 year ago

I liked the suggestion to to: "first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count()"

Following up on the documentation @swheaton shared (emphasis mine):

"The count is based on the collection's metadata, which provides a fast but sometimes inaccurate count for sharded clusters."

For a local dataset - I wouldn't expect it to be using sharded clusters - so I expect this to be a non-issue.

If I'm being naive, here is another possible workaround - db.printShardingStatus or sh.status to find the sharding status of the mongodb maybe at fiftyone load time. Then use that to determine if the more accurate dataset.count() should be used. Maybe other operations will also be impacted by sharding.

Curious if you have lots of customers using sharding?

shortcipher3 commented 1 year ago

I guess the unclean shutdown is still a possible issue though.

voxel51 / fiftyone