Closed hitenvidhani closed 1 year ago
I tested this out and it seems reasonably fast (~0.1s per dataset) and accurate! While this will slow down our script by ~3 seconds for 25 datasets, I think that's ok.
+1 to Chenyang's point about API handling (e.g. we should have default behavior if there are exceptions, e.g. log the error and return NaN as the size).
Can you also add a test case for this? I can add the test in a separate PR if you're not sure how to go about doing this in our repo. I'd suggest you just test get_dataset_size
by mocking the execution of prompt2model.utils.dataset_utils.query
using the unittest.patch
library.
@hitenvidhani Actually, it seems that something is not working correctly when I built and tested this locally. After starting dataset retrieval, I see:
Here are the datasets I've retrieved for you:
# Name Size[MB] Description
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_para', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 556068, 'num_bytes_memory': 1097064, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
1): yulongmannlp/dev_para 76.72 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_orig', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 553991, 'num_bytes_memory': 1095012, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
2): yulongmannlp/dev_orig 76.71 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Thank you @viswavi, have pushed the requested changes, sorry about that print statement it was added for debugging.
Hey @zhaochenyang20 , thanks for the suggestion!
One counter-suggestion, because @hitenvidhani is a first-time contributor and we've already gone back-and-forth on this PR several times, maybe we can merge the PR for now and then do a follow-up PR for unit tests? Unit tests are important, but it's also important that we welcome new contributors (and we do!) so we can make the process a little simpler this time.
In this case, I gonna merge it and let us add unit tests ourselves. 🤔
Description
Dataset size on CLI by fetching size on runtime
References
https://github.com/neulab/prompt2model/issues/253
Blocked by
NA