hitenvidhani commented 1 year ago

Description

Dataset size on CLI by fetching size on runtime

References

https://github.com/neulab/prompt2model/issues/253

Blocked by

NA

viswavi commented 1 year ago

I tested this out and it seems reasonably fast (~0.1s per dataset) and accurate! While this will slow down our script by ~3 seconds for 25 datasets, I think that's ok.

+1 to Chenyang's point about API handling (e.g. we should have default behavior if there are exceptions, e.g. log the error and return NaN as the size).

Can you also add a test case for this? I can add the test in a separate PR if you're not sure how to go about doing this in our repo. I'd suggest you just test get_dataset_size by mocking the execution of prompt2model.utils.dataset_utils.query using the unittest.patch library.

viswavi commented 1 year ago

@hitenvidhani Actually, it seems that something is not working correctly when I built and tested this locally. After starting dataset retrieval, I see:

Here are the datasets I've retrieved for you:
#   Name    Size[MB]    Description
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_para', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 556068, 'num_bytes_memory': 1097064, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
1): yulongmannlp/dev_para   76.72   Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_orig', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 553991, 'num_bytes_memory': 1095012, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
2): yulongmannlp/dev_orig   76.71   Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

hitenvidhani commented 1 year ago

Thank you @viswavi, have pushed the requested changes, sorry about that print statement it was added for debugging.

neubig commented 1 year ago

Hey @zhaochenyang20 , thanks for the suggestion!

One counter-suggestion, because @hitenvidhani is a first-time contributor and we've already gone back-and-forth on this PR several times, maybe we can merge the PR for now and then do a follow-up PR for unit tests? Unit tests are important, but it's also important that we welcome new contributors (and we do!) so we can make the process a little simpler this time.

zhaochenyang20 commented 1 year ago

In this case, I gonna merge it and let us add unit tests ourselves. 🤔

neulab / prompt2model

Dataset size on CLI #345

Description

References

Blocked by