memphis-iis / datawhys-content-notebooks-python

Content for DataWhys in the form of JupyterLab notebooks (.ipynb files)
Apache License 2.0
8 stars 2 forks source link

Terminology #24

Closed aolney closed 2 years ago

aolney commented 4 years ago

Please use this issue to discuss pros/cons of terminology :)

I've created a OneDrive sheet where we can store the results of our discussions (i.e. terminology decisions).

aolney commented 4 years ago

Quoting @vasilerus

For instance, should we all call the test labels just that, test labels, or test gold labels to indicate that they are generated by experts, i.e., they are ground truth, as opposed to predicted labels. Not sure if Dale uses that terminology or not. Test_labels vs. test_gold_labels vs. test_groundtruth_labels vs. something else.

aolney commented 4 years ago

Quoting @nsahr

I know the terminology you used is familiar for data and computer science fields. But, in my experience, it is not used in statistics/biostatistics. Terminology would be actual [outcome] and predicted [outcome]. I’m fine naming these consistently and I am glad you brought it up now. I would just need clear definitions as the names of technical terminology are different between fields.

aolney commented 4 years ago

I prefer training labels or labeled training data and similar for test. I see this different from the stats terminology in the sense that predicted and outcome more generally apply to notions of accuracy, whereas in ML the idea of data being labeled has to do with the process that generated those labels. Since not all data is labeled, the term labeled is very useful/descriptive.

nsahr commented 4 years ago

That is true. But, when we are talking about the statistical theory associated with these methods, the statistical terms would be brought up or it wouldn’t be consistent with any statistics literature available. “Y” is generally labeled the response (or something similar) for most statistics theory. I think names for the programming is okay, but I don’t think I would be able to avoid the statistics terms in my theory... just would depend how much theory you would be looking for here (to me). Labeled and unlabeled would only be used in my terminology for classification problems. IMHO


From: Andrew M Olney notifications@github.com Sent: Sunday, May 17, 2020 3:08:51 PM To: memphis-iis/datawhys-content-notebooks datawhys-content-notebooks@noreply.github.com Cc: Natasha Sahr nsahrsimonis@gmail.com; Mention mention@noreply.github.com Subject: Re: [memphis-iis/datawhys-content-notebooks] Terminology (#24)

I prefer training labels or labeled training data and similar for test. I see this different from the stats terminology in the sense that predicted and outcome more generally apply to notions of accuracy, whereas in ML the idea of data being labeled has to do with the process that generated those labels. Since not all data is labeled, the term labeled is very useful/descriptive.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/memphis-iis/datawhys-content-notebooks/issues/24#issuecomment-629845537, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHEKYFTNUDJ4A2HHNMKAFQLRSAY4HANCNFSM4NDQAFYQ.

aolney commented 4 years ago

@nsahr Maybe the solution here is to put forward both ML and statistics terminology and explicitly teach them how to map between the two? That said, we'd probably want to be consistent within each to simplify the terminology.

Going back to what @vasilerus said, I think the process of generation is key to the idea of "label" in ML. For example, if I have a bunch of experts create ordinal, interval, or ratio scale data, I'd still call that "labeled" in an ML context, even though it isn't a classification problem. Conversely, if the "true" data was measured somehow (e.g. EKG data), I wouldn't use labeled, but just train/test.

nsahr commented 4 years ago

Any terminology is fine with me as long as definitions are clear (so I can use them in the correct framework). Sounds like a consensus would be to use the term “labeled” on anything that has a process to generate the variable?


From: Andrew M Olney notifications@github.com Sent: Monday, May 18, 2020 11:02:49 AM To: memphis-iis/datawhys-content-notebooks datawhys-content-notebooks@noreply.github.com Cc: Natasha Sahr nsahrsimonis@gmail.com; Mention mention@noreply.github.com Subject: Re: [memphis-iis/datawhys-content-notebooks] Terminology (#24)

@nsahrhttps://github.com/nsahr Maybe the solution here is to put forward both ML and statistics terminology and explicitly teach them how to map between the two? That said, we'd probably want to be consistent within each to simplify the terminology.

Going back to what @vasilerushttps://github.com/vasilerus said, I think the process of generation is key to the idea of "label" in ML. For example, if I have a bunch of experts create ordinal, interval, or ratio scale data, I'd still call that "labeled" in an ML context, even though it isn't a classification problem. Conversely, if the "true" data was measured somehow (e.g. EKG data), I wouldn't use labeled, but just train/test.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/memphis-iis/datawhys-content-notebooks/issues/24#issuecomment-630242302, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHEKYFXKEUQVFES4SQNVKXTRSFEZTANCNFSM4NDQAFYQ.

ddbowman commented 4 years ago

I think it is important to use all terminology, well defined as Tasha said, because in practice they will encounter many different terms for the same concept.

aolney commented 4 years ago

@ddbowman I strongly agree with you because otherwise they won't understand terms when they encounter them in the real world. However, it's not clear to me how fast we should try to broaden their terminology. For example, should we start with 1 or 2 terms and then slowly broaden?

aolney commented 2 years ago

Closing this as effectively done. If we decide to do some analysis of broadening terminology, I suggest we start a new issue for that.