Library should use either "data set" or "dataset" consistently

deepyaman commented 4 years ago

Description

I'm always frustrated when I want to create a new data set... or is it dataset?

Context

As a contributor, I do my best to align to the prevailing style/convention.
As a user, I try to mimic Kedro convention when I create my own datasets.

In both cases, it bothers/confuses me to see inconsistencies. Maybe it's just me. 🤷‍♂️

List of Inconsistencies

All datasets are named SomeDataSet (data set as two words), but the module name is almost always some_dataset.py (dataset as one word).
- Notable exceptions include the kedro.io module, where everything is some_data_set.py.
- Except cached_dataset.py...
- Tests for this module are another thing altogether, where half use the split form and the other half combined, not even aligned with the the module being tested (e.g. partitioned_data_set.py and test_partitioned_dataset.py).
I'm (personally) less concerned by the fact. that catalog.datasets is a single word, as it could be argued that it's done for convenient reasons. Similarly, extras.datasets makes sense as a single word, since it's more aligned with Python package naming convention.
I'm sure there are inconsistencies in the docs here, too, but I'm less concerned than for the actual codebase.

Much of this is trivial to address, although you do need to update some tests that rely on module name (e.g. test_partitioned_dataset.py).

lorenabalan commented 4 years ago

I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?

deepyaman commented 4 years ago

I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?

I feel like "dataset" has been adopted by a lot of major players. A few top hits off a search for data set:

Datasets - https://catalog.data.gov/
Engage with Dataset Tasks - https://www.kaggle.com/datasets
Dataset Search - https://datasetsearch.research.google.com/

Definitions often list both forms:

Here's some data (biased towards books, FWIW): https://books.google.com/ngrams/graph?content=data+set%2Cdataset&year_start=1800&year_end=2019&corpus=26&smoothing=3&case_insensitive=true

Some more, driven by the all-knowing Twitterverse: https://twitter.com/randal_olson/status/824702008007557121?lang=en

However, a consideration against switching to "dataset" is that it messes with all of the classes. You could deprecate the two-word form now and remove it in 0.17, but I'm sure it'll cause some pain. Granted, that pain would be incurred if you ever wanted to make that switch. For me, I'm happy as long as it's (mostly) consistent (i.e. OK with the catalog.datasets and BlahDataSet mismatch, since I can justify it in my head above).

stichbury commented 4 years ago

Yep @lorenabalan @deepyaman It is indeed a dataset :) if you follow our docs style guide.

https://github.com/quantumblacklabs/private-kedro/blob/master/docs/README.md#kedro-lexicon

Use dataset (not data set, or data-set) for a generic dataset.
Use capitalised DataSet when talking about a specific Kedro dataset class e.g. CSVDataSet.
Use data catalog for a generic data catalog.
Use Data Catalog to talk about the Kedro Data Catalog

I think we are pretty much consistent in the docs in using the single word, no hyphenation. I've no opinion of the choice used for classes and in code TBH since I don't believe the code and docs have to follow each other exactly. Consistency is key though.

yetudada commented 3 years ago

I'm just following up, is it okay to close this issue?

stichbury commented 3 years ago

:100: from me

deepyaman commented 3 years ago

I'm just following up, is it okay to close this issue?

@yetudada I don't think it's resolved in the code. At a minimum, can we change:

lamda_data_set.py -> lamda_dataset.py
memory_data_set.py -> memory_dataset.py
partitioned_data_set.py -> partitioned_dataset.py
test_lambda_data_set.py -> test_lambda_dataset.py
test_memory_data_set.py -> test_memory_dataset.py

This way, at least module naming is consistent/people don't have to question which form to use. I'm happy to do this, given a green light.

I assume we're not good to change SparkDataSet -> SparkDataset, etc., through the codebase lol? I think it would eventually align with the lexicon that way, but it would need to be deprecated in this major release and removed in the next one.

merelcht commented 3 years ago

Hi @deepyaman, yes that sounds good. We're very happy to accept a PR from you for this 😄

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

astrojuanlu commented 1 year ago

Spotted in kedro-org/kedro-starters#137:

https://github.com/kedro-org/kedro/blob/fd8162d5bf384ef666c01ef2c529d01fd9fa8354/kedro/io/core.py#L430

astrojuanlu commented 1 year ago

We've decided to use "dataset" in prose, and FooDataset in class names in this issue.

This was partially achieved in https://github.com/quantumblacklabs/private-kedro/pull/1211 (private), and then a series of other PRs recently, including gh-2500, gh-2673, gh-2724, gh-2735, mostly by @deepyaman.

@noklam collected some context in gh-2740, let's continue the conversation there.

kedro-org / kedro