Closed deepyaman closed 1 year ago
I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?
I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?
I feel like "dataset" has been adopted by a lot of major players. A few top hits off a search for data set
:
Definitions often list both forms:
Here's some data (biased towards books, FWIW): https://books.google.com/ngrams/graph?content=data+set%2Cdataset&year_start=1800&year_end=2019&corpus=26&smoothing=3&case_insensitive=true
Some more, driven by the all-knowing Twitterverse: https://twitter.com/randal_olson/status/824702008007557121?lang=en
However, a consideration against switching to "dataset" is that it messes with all of the classes. You could deprecate the two-word form now and remove it in 0.17, but I'm sure it'll cause some pain. Granted, that pain would be incurred if you ever wanted to make that switch. For me, I'm happy as long as it's (mostly) consistent (i.e. OK with the catalog.datasets
and BlahDataSet
mismatch, since I can justify it in my head above).
Yep @lorenabalan @deepyaman It is indeed a dataset :) if you follow our docs style guide.
https://github.com/quantumblacklabs/private-kedro/blob/master/docs/README.md#kedro-lexicon
Use dataset (not data set, or data-set) for a generic dataset.
Use capitalised DataSet when talking about a specific Kedro dataset class e.g. CSVDataSet.
Use data catalog for a generic data catalog.
Use Data Catalog to talk about the Kedro Data Catalog
I think we are pretty much consistent in the docs in using the single word, no hyphenation. I've no opinion of the choice used for classes and in code TBH since I don't believe the code and docs have to follow each other exactly. Consistency is key though.
I'm just following up, is it okay to close this issue?
:100: from me
I'm just following up, is it okay to close this issue?
@yetudada I don't think it's resolved in the code. At a minimum, can we change:
lamda_data_set.py
-> lamda_dataset.py
memory_data_set.py
-> memory_dataset.py
partitioned_data_set.py
-> partitioned_dataset.py
test_lambda_data_set.py
-> test_lambda_dataset.py
test_memory_data_set.py
-> test_memory_dataset.py
This way, at least module naming is consistent/people don't have to question which form to use. I'm happy to do this, given a green light.
I assume we're not good to change SparkDataSet
-> SparkDataset
, etc., through the codebase lol? I think it would eventually align with the lexicon that way, but it would need to be deprecated in this major release and removed in the next one.
Hi @deepyaman, yes that sounds good. We're very happy to accept a PR from you for this š
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Spotted in kedro-org/kedro-starters#137:
We've decided to use "dataset" in prose, and FooDataset
in class names in this issue.
This was partially achieved in https://github.com/quantumblacklabs/private-kedro/pull/1211 (private), and then a series of other PRs recently, including gh-2500, gh-2673, gh-2724, gh-2735, mostly by @deepyaman.
@noklam collected some context in gh-2740, let's continue the conversation there.
Description
I'm always frustrated when I want to create a new data set... or is it dataset?
Context
In both cases, it bothers/confuses me to see inconsistencies. Maybe it's just me. š¤·āāļø
List of Inconsistencies
SomeDataSet
(data set as two words), but the module name is almost alwayssome_dataset.py
(dataset as one word).some_data_set.py
.cached_dataset.py
...partitioned_data_set.py
andtest_partitioned_dataset.py
).catalog.datasets
is a single word, as it could be argued that it's done for convenient reasons. Similarly,extras.datasets
makes sense as a single word, since it's more aligned with Python package naming convention.Much of this is trivial to address, although you do need to update some tests that rely on module name (e.g.
test_partitioned_dataset.py
).