Open swheaton opened 1 year ago
Using estimated count for summary()
and hence print(dataset)
definitely makes sense. It's just informational.
A concern with using an inexact count for dataset.count()
and hence len(dataset)
is that users may be using this count for operations that require exactness.
For example, if len(dataset)
is not exact, users may run into unhappiness when using certain batch update patterns:
dataset.set_values("foo", ["bar"] * len(dataset))
Now, in a multiuser environment like Teams, one should not really be doing operations like that anyway. You should structure your operations based on immutable things like IDs so that the code is robust to coincident unrelated updates:
dataset.set_values("foo", {"id1": "bar", "id2": "spam", "id3": "eggs", ...}, key_field="id")
But, in OSS, since its a local dataset and you're generally the only user (is database sharing the new Netflix password sharing? 🤣) it would be reasonable to expect that things like this should work:
# lots of samples
samples = ...
dataset = fo.Dataset()
dataset.add_samples(samples)
# oops I forgot a field. Ah I know, I'll add it efficiently via batching!
values = [... for _ in range(len(dataset))]
dataset.set_values("foo", values)
Yeah that makes sense. It's more exact than the name "estimate" makes it seem, but yeah to rely on length, we will have to keep it using the slow way unfortunately.
This was my address to this point originally
One potential mitigation to the above referenced unlikely cases is to first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count() because it is unlikely to have a big performance impact.
But users could still write a script that uses len on a dataset size N+1 and it may or may not work as expected.
I will submit a PR that uses stats()
for summary()
then but NOT count()
or __len__
I liked the suggestion to to: "first do a stats() call and if the num samples is less than some N, do a more complete Dataset.count()
"
Following up on the documentation @swheaton shared (emphasis mine):
"The count is based on the collection's metadata, which provides a fast but sometimes inaccurate count for sharded clusters."
For a local dataset - I wouldn't expect it to be using sharded clusters - so I expect this to be a non-issue.
If I'm being naive, here is another possible workaround - db.printShardingStatus or sh.status to find the sharding status of the mongodb
maybe at fiftyone
load time. Then use that to determine if the more accurate dataset.count()
should be used. Maybe other operations will also be impacted by sharding.
Curious if you have lots of customers using sharding?
I guess the unclean shutdown is still a possible issue though.
Proposal Summary
Proposing we use
collStats
instead ofcount
aggregation forDataset.__len__()
andDataset.summary()
in some or all cases. This approach can only be used for dataset length because it's not possible with any view / filter applied.Motivation
Dataset.count()
is extremely slow for datasets with large number of samples, thereforeprint(dataset)
is also slowWhat areas of FiftyOne does this feature affect?
fiftyone
Python libraryDetails
Dataset.stats()
usescollStats
, equivalent toestimated_document_count
. According to MongoDB documentation, this is considered to be an estimated count but is usually quite accurate. Emphasis mineSo with typical MongoDB settings and ignoring unclean shutdown situations, getting the length of a dataset in this way should be accurate to within 60s. Given the high read-to-write ratio of most FiftyOne users, this is probably accurate enough in almost all cases.
One potential mitigation to the above referenced unlikely cases is to first do a stats() call and if the num samples is less than some N, do a more complete
Dataset.count()
because it is unlikely to have a big performance impact.Potential issue
One issue with this is for group datasets. Currently length is determined as number of groups not number of samples.
collStats
by its nature can only determine number of samples as it has no concept of FiftyOne groups, only documents.Options:
Rebuttal
This is much less natural and requires knowledge of the internal mongodb methods being used for each. Plus there is no option to do this for
print(dataset)
/dataset.summary()
.Willingness to contribute
The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?