kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.95k stars 903 forks source link

Library should use either "data set" or "dataset" consistently #533

Closed deepyaman closed 1 year ago

deepyaman commented 4 years ago

Description

I'm always frustrated when I want to create a new data set... or is it dataset?

Context

In both cases, it bothers/confuses me to see inconsistencies. Maybe it's just me. šŸ¤·ā€ā™‚ļø

List of Inconsistencies

Much of this is trivial to address, although you do need to update some tests that rely on module name (e.g. test_partitioned_dataset.py).

lorenabalan commented 4 years ago

I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?

deepyaman commented 4 years ago

I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?

I feel like "dataset" has been adopted by a lot of major players. A few top hits off a search for data set:

Definitions often list both forms:

Here's some data (biased towards books, FWIW): https://books.google.com/ngrams/graph?content=data+set%2Cdataset&year_start=1800&year_end=2019&corpus=26&smoothing=3&case_insensitive=true

Some more, driven by the all-knowing Twitterverse: https://twitter.com/randal_olson/status/824702008007557121?lang=en

However, a consideration against switching to "dataset" is that it messes with all of the classes. You could deprecate the two-word form now and remove it in 0.17, but I'm sure it'll cause some pain. Granted, that pain would be incurred if you ever wanted to make that switch. For me, I'm happy as long as it's (mostly) consistent (i.e. OK with the catalog.datasets and BlahDataSet mismatch, since I can justify it in my head above).

stichbury commented 4 years ago

Yep @lorenabalan @deepyaman It is indeed a dataset :) if you follow our docs style guide.

https://github.com/quantumblacklabs/private-kedro/blob/master/docs/README.md#kedro-lexicon

Use dataset (not data set, or data-set) for a generic dataset.
Use capitalised DataSet when talking about a specific Kedro dataset class e.g. CSVDataSet.
Use data catalog for a generic data catalog.
Use Data Catalog to talk about the Kedro Data Catalog

I think we are pretty much consistent in the docs in using the single word, no hyphenation. I've no opinion of the choice used for classes and in code TBH since I don't believe the code and docs have to follow each other exactly. Consistency is key though.

yetudada commented 3 years ago

I'm just following up, is it okay to close this issue?

stichbury commented 3 years ago

:100: from me

deepyaman commented 3 years ago

I'm just following up, is it okay to close this issue?

@yetudada I don't think it's resolved in the code. At a minimum, can we change:

This way, at least module naming is consistent/people don't have to question which form to use. I'm happy to do this, given a green light.

I assume we're not good to change SparkDataSet -> SparkDataset, etc., through the codebase lol? I think it would eventually align with the lexicon that way, but it would need to be deprecated in this major release and removed in the next one.

merelcht commented 3 years ago

Hi @deepyaman, yes that sounds good. We're very happy to accept a PR from you for this šŸ˜„

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

astrojuanlu commented 1 year ago

Spotted in kedro-org/kedro-starters#137:

https://github.com/kedro-org/kedro/blob/fd8162d5bf384ef666c01ef2c529d01fd9fa8354/kedro/io/core.py#L430

astrojuanlu commented 1 year ago

We've decided to use "dataset" in prose, and FooDataset in class names in this issue.

This was partially achieved in https://github.com/quantumblacklabs/private-kedro/pull/1211 (private), and then a series of other PRs recently, including gh-2500, gh-2673, gh-2724, gh-2735, mostly by @deepyaman.

@noklam collected some context in gh-2740, let's continue the conversation there.