tbjohns / BlitzML

BSD 3-Clause "New" or "Revised" License
4 stars 5 forks source link

r-package #1

Open dselivanov opened 6 years ago

dselivanov commented 6 years ago

Hi Tyler. I've started to work on R wrapper here. There several challenges so far:

  1. I'm getting too many arguments provided to function-like macro here https://github.com/dselivanov/BlitzML/commit/master#diff-137530b9340672b840093302a73c2bf3L29. So I had to comment out assert statements.
  2. indices and pointers in sparse matrices in R are integers. In order to work with sparse matrices "in place" I had to re-define size_t and index_t to be uint32_t - https://github.com/dselivanov/BlitzML/commit/master#diff-137530b9340672b840093302a73c2bf3R21.

Let me know what do you think whether such adjustments can be merged to master.

Also I found a bit confusing how Parameters are defined and passed from python code. Mb it worth to have a structure for that? construction of such small object shouldn't hurt performance.

tbjohns commented 6 years ago

Hi Dmitriy, thanks for these contributions.

We have to make sure the types are also compatible with Python in place. I think Scipy sparse arrays use int32 indexing by default, so it’d be best to support both. We can template the indexing type for SparseColumn and SparseDataset. Let me know what your thoughts are.

The assert issue seems like a namespacing issue? Maybe we can rename it to "assert_with_error_message." If not, I don’t understand the issue.

I agree the parameters class is not straightforward. Feel free to modify the C class for your R package, and then you or I can modify the Python package as well.

Please continue giving me feedback if you see other things that should be improved. You have much more open source experience than I do. I’m very busy this week, but I’ll look at your R wrapper more closely next week. Thanks!

dselivanov commented 6 years ago

@tbjohns I've renamed assert and made definition of size_t inside #ifdef block - see here. When I build R package I define that indices should be 4 bytes and use uint32_t, so we can use R sparse matrices in place. For python it just builds with standard std::size_t.

I've tested both python and R versions and everything works fine. You can take a look to example in README for R package - now fully functional.

tbjohns commented 6 years ago

Sounds great! I won't have time to look at this for a few days, but I will soon.

tbjohns commented 6 years ago

@dselivanov, this R package is looking nice. I installed the package and ran your example without any issues. What are your intentions as far as when I should merge it?

I just added an important fix (1 line) for logistic regression. Your example will complete much faster after incorporating this.

For the cross validation setup, the lambda values should probably be sorted in decreasing order, so that the problem with the most regularization is solved first. This should make the warm-starting more effective -- at least that's the standard in academia for this particular problem.

Another thing is you might want to consider removing the use_working_sets argument to avoid possible confusion for the user. I mainly had this in my Python package so that I could compare to the non-working set algorithm for my research. AFAIK, there isn't ever any significant downside to using working sets.

dselivanov commented 6 years ago

Thanks for taking a look. I definitely want to send PR to merge it to the main repo. However as you noticed I've finished wrapper only for logistic regression on sparse matrices (for the problem at I had to solve at that point of time). Not sure that during next couple of weeks I will have time to finish porting the rest of functionality.

Thanks for the cross-validation tip - I also found that decreasing sequence of lambda works better (even fixed in local version, but haven't committed/pushed to github). Also I will remove working_sets option as you suggested.

So it is up to you - I can send PR now and update it later. Or I can send it when I will wrap more functionality.

tbjohns commented 6 years ago

Okay great. Let's wait on the PR until you feel it is more finalized. It looks good, though. I should add a similar multithreaded cross-val option for the Python package.

dselivanov commented 6 years ago

@tbjohns do I understand correctly that at the moment only logistic and squared loss functions are available for use in Python/R interface?

tbjohns commented 6 years ago

@dselivanov, the Python package should support each of the loss functions listed in the readme. I separated them into different problem classes. See https://tbjohns.github.io/BlitzML/api.html#module-blitzml (so for example blitzml.SparseHuberProblem). Are you seeing an issue somewhere?

dselivanov commented 6 years ago

I've implemented all problems in R pkg similarly as you've done in python pkg. However they doesn't seem to work. Mb I'm missing something. Will try python pkg and report back. Also I'm curious what is the reason you've implemented specialized C++ classes for squared loss (lasso) and logistic loss, but not for other - https://github.com/tbjohns/BlitzML/tree/master/include/blitzml/sparse_linear ?

tbjohns commented 6 years ago

Ok let me know if you find an issue.

I optimized logreg and Lasso quite a bit because they're the most commonly used. For logreg, a major bottleneck is frequent exp calls, so I tried to minimize those. For Lasso, the quadratic loss allows us to cut out some computation. Basically, the other problems get reduced to a sequence of Lasso problems, but since the Lasso loss already is quadratic, it's a little simpler.

I see the issue you created, but I'll have to look at it later.