Closed minhtuev closed 2 months ago
The recent updates to the FiftyOne library enhance the Dataset
and Collection
classes by introducing new parameters in their respective methods, allowing users to include index-related statistics. Minor formatting adjustments were also made to the DatasetNotFoundError
. These changes enrich the functionality of the library, providing users with more detailed insights into datasets and indexes.
Files | Change Summary |
---|---|
fiftyone/core/dataset.py |
Modified DatasetNotFoundError for minor formatting; updated stats method to include include_indexes parameter, enhancing statistical output options. |
fiftyone/core/collections.py |
Updated stats and get_index_information methods to include include_indexes and include_size parameters, respectively, for richer index-related insights. |
tests/unittests/dataset_tests.py |
Added test_index_sizes method to validate indexing functionality, ensuring proper index creation and size retrieval. |
In the garden where datasets bloom,
A new feature brings joy, dispelling the gloom.
With indexes counted, insights to find,
Stats now more rich, oh, how they unwind!
Hop along, friends, to the data delight,
For every new change is a reason to write! 🐇✨
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Examples:
@benjaminpkane : probably I should do some conversions for the index sizes to MB
In [1]: import fiftyone as fo
...: import fiftyone.zoo as foz
In [2]: dataset = fo.load_dataset("oi-v7full-10mm")
In [3]: dataset.stats(include_indexes=True)
Out[3]:
{'samples_count': 10000000,
'samples_bytes': 156539929160,
'samples_size': '145.8GB',
'nindexes': 3,
'totalIndexSize': 474648576,
'indexSizes': {'_id_': 130613248,
'filepath_1': 199389184,
'detections.detections.str1_1': 144646144},
'total_bytes': 156539929160,
'total_size': '145.8GB'}
In [1]: import fiftyone as fo
...: import fiftyone.zoo as foz
In [2]: dataset = fo.load_dataset("oi-v7full-10mm")
In [3]: dataset.get_index_information(include_size=True)
Out[3]:
{'id': {'v': 2, 'key': [('_id', 1)], 'size': 130613248},
'filepath': {'v': 2, 'key': [('filepath', 1)], 'size': 199389184},
'detections.detections.str1': {'v': 2,
'key': [('detections.detections.str1', 1)],
'size': 144646144}}
Converted from raw bytes to human-readable byte strings
In [3]: dataset.stats(include_indexes=True)
Out[3]:
{'samples_count': 10000000,
'samples_bytes': 156539929160,
'samples_size': '145.8GB',
'nindexes': 3,
'totalIndexSize': 474648576,
'indexSizes': {'_id_': '124.6MB',
'filepath_1': '190.2MB',
'detections.detections.str1_1': '137.9MB'},
'total_bytes': 156539929160,
'total_size': '145.8GB'}
In [4]: dataset.get_index_information(include_size=True)
Out[4]:
{'id': {'v': 2, 'key': [('_id', 1)], 'size': '124.6MB'},
'filepath': {'v': 2, 'key': [('filepath', 1)], 'size': '190.2MB'},
'detections.detections.str1': {'v': 2,
'key': [('detections.detections.str1', 1)],
'size': '137.9MB'}}
@brimoor : makes sense, done ✅
In [1]: import fiftyone as fo
...: import fiftyone.zoo as foz
In [2]: dataset = fo.load_dataset("oi-v7full-10mm")
In [3]: dataset.stats(include_indexes=True)
Out[3]:
{'samples_count': 10000000,
'samples_bytes': 156539929160,
'samples_size': '145.8GB',
'num_indexes': 3,
'indexes_bytes': 474648576,
'indexes_sizes': '452.7MB',
'index_bytes': {'_id_': 130613248,
'filepath_1': 199389184,
'detections.detections.str1_1': 144646144},
'index_sizes': {'_id_': '124.6MB',
'filepath_1': '190.2MB',
'detections.detections.str1_1': '137.9MB'},
'total_bytes': 156539929160,
'total_size': '145.8GB'}
In [4]: dataset.get_index_information(include_size=True)
Out[4]:
{'id': {'v': 2, 'key': [('_id', 1)], 'size': '124.6MB', 'bytes': 130613248},
'filepath': {'v': 2,
'key': [('filepath', 1)],
'size': '190.2MB',
'bytes': 199389184},
'detections.detections.str1': {'v': 2,
'key': [('detections.detections.str1', 1)],
'size': '137.9MB',
'bytes': 144646144}}
What changes are proposed in this pull request?
Added an argument
include_indexes
so that we can return index statisticsHow is this patch tested? If it is not, please explain why.
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
Users can now view index statistics using
dataset.stats(include_indexes=True)
What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit
stats
method now includes an optional parameter to include index statistics, enhancing insights into the dataset structure.get_index_information
method has been updated with a new parameter to optionally include index sizes, providing more comprehensive index information.