[FEA] Label (or target) type expected by classification algos is inconsistent, especially in SVM's

rapidsai / cuml

cuML - RAPIDS Machine Learning Library

https://docs.rapids.ai/api/cuml/stable/

Apache License 2.0

4.17k stars 527 forks source link

[FEA] Label (or target) type expected by classification algos is inconsistent, especially in SVM's #2603

Closed teju85 closed 3 years ago

teju85 commented 4 years ago

This was pointed out to me by @cdeotte. SVC expects the targets to be float's, whereas RFClassifier expects it to be integers. We need to have SVC support both float/int based target arrays as inputs. Tagging @tfeher for his inputs in this matter.

tfeher commented 4 years ago

In the C++ layer SVC does expect float types for both the features and the target labels. It would be trivial to have a separate label type (in fact originally I had one but dropped it to simplify the cython wrappers).

From the Python side SVC accepts any numeric type as targets, it is converted to float under the hood. It is a documentation bug that claims that y should be float or double. I will fix the documentation bug in the next SVC PR.

teju85 commented 4 years ago

Thanks Tamas.

@cdeotte can you confirm if you faced this issue while running a code or you just noticed this in our docs?

Salonijain27 commented 4 years ago

@tfeher if the user passes y of dtype int will it increase the memory/time consumption of SVC? If yes, i think it would be nice to mention that in the docs as well.

cdeotte commented 4 years ago

I observed this behavior with code not reading docs.

Some Regressors can take both int and float targets. Some Regressors can only take float targets. Some Classifiers take int targets, some classifiers take float targets, some classifiers take both int and float targets. It would be better if they all can take float and int. Then someone can write a for loop which uses all of them without having to get the dtype correct each time.

Here is what i observed via code in cuML 0.14 regarding acceptable target values

kNN Classifier - float or int Random Forest Classifier - int only SVC - float or int Logistic Regression (classifier) - float only

Lasso Reg - float only Linear Reg - float only Ridge Reg - float only SVR - float or int kNN Reg - float or int SGD - float only

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

teju85 commented 3 years ago

@tfeher like we discussed today, the documentation on SVC is still not clear about the underlying type conversion (and slightly more memory usage + runtime). Please file a PR for this and then we can close this issue.

I'll file a separate issue to check the input label consistency across other algos that @cdeotte mentioned above.

tfeher commented 3 years ago

The convert_dtype arg description already explains that "When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method."

The extra O(n_rows) runtime that the dtype copy adds is negligible compared to the O(n_rows^2) - O(n_rows^3) runtime of the training, therefore I do not think it is worth mentioning.