Closed AMR-KELEG closed 2 years ago
@nok I have checked the output of the following code and the output doesn't seem to be correct for sparse matrices.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn_porter import Porter
cv = CountVectorizer()
l = ['Pattern 1', 'Pattern 2', 'Pattern 3']
X = cv.fit_transform(l)
y = [1, 2, 3]
clf = svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1/X.shape[1], kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.fit(X, y)
porter = Porter(clf, language='java')
output = porter.export(embed_data=False, details=False)
with open('SVC.java', 'w') as f:
f.writelines(output)
Output:
double[][] vectors = {{ (0, 0) 1.0
(0, 2) 1.0
(0, 3) 1.0}, { (0, 1) 1.0
(0, 3) 1.0}, { (0, 0) 1.0
(0, 3) 1.0}};
double[][] coefficients = {{ (0, 0) 0.6666666666666666
(0, 1) -0.6666666666666666
(0, 2) -1.0}, { (0, 0) 1.0
(0, 1) 1.0
(0, 2) -1.0}};
A trivial solution is to change the sparse matrix into a dense one using the todense function(https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csc_matrix.todense.html) before exporting the classifier however this will affect the performance badly.
Thanks for your contribution @AMR-KELEG , what did you mean with 'performance badly'? The quality/integrity of the results or the execution time? Here your example is a good starting point for further tests.
I thought about decoding the compressed sparse row format in the target programming language, because we should't ignore the advantages of the compact data format. Further the basic algorithm of CSR is quite simple.
I meant that CSR format should be used as you have said. I also think that transpiling feature extraction functions such as tf (term frequency) should be considered.
Fixes #34