Fix the usage of len for sparse matrix features in SVC

AMR-KELEG commented 6 years ago

Fixes #34

AMR-KELEG commented 6 years ago

@nok I have checked the output of the following code and the output doesn't seem to be correct for sparse matrices.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn_porter import Porter

cv = CountVectorizer()
l = ['Pattern 1', 'Pattern 2', 'Pattern 3']
X = cv.fit_transform(l)
y = [1, 2, 3]

clf = svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1/X.shape[1], kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
clf.fit(X, y)

porter = Porter(clf, language='java')
output = porter.export(embed_data=False, details=False)
with open('SVC.java', 'w') as f:
    f.writelines(output)

Output:

            double[][] vectors = {{  (0, 0) 1.0
  (0, 2)    1.0
  (0, 3)    1.0}, {  (0, 1) 1.0
  (0, 3)    1.0}, {  (0, 0) 1.0
  (0, 3)    1.0}};
            double[][] coefficients = {{  (0, 0)    0.6666666666666666
  (0, 1)    -0.6666666666666666
  (0, 2)    -1.0}, {  (0, 0)    1.0
  (0, 1)    1.0
  (0, 2)    -1.0}};

A trivial solution is to change the sparse matrix into a dense one using the todense function(https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csc_matrix.todense.html) before exporting the classifier however this will affect the performance badly.

nok commented 6 years ago

Thanks for your contribution @AMR-KELEG , what did you mean with 'performance badly'? The quality/integrity of the results or the execution time? Here your example is a good starting point for further tests.

I thought about decoding the compressed sparse row format in the target programming language, because we should't ignore the advantages of the compact data format. Further the basic algorithm of CSR is quite simple.

AMR-KELEG commented 6 years ago

I meant that CSR format should be used as you have said. I also think that transpiling feature extraction functions such as tf (term frequency) should be considered.

nok / sklearn-porter

Fix the usage of len for sparse matrix features in SVC #36