rcmccartney / parallel_regression

Parallel regression for machine learning on 2-class classification problems
4 stars 0 forks source link

parallel_regression

PJ2 library

This program uses the PJ2 library for parallel programming, located at http://www.cs.rit.edu/~ark/pj2.shtml.

The paradigm runs on a multicore or multinode cluster for running Tasks and Jobs. This program is used for logistic regression on massive datasets. It can be used in both sequential and parallel verisons, and the parallel version can be set to batched training, based off of the following paper:

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. 2014. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '14). ACM, New York, NY, USA, 661-670. URL:http://dl.acm.org/citation.cfm?id=2623330.2623612&coll=DL&dl=ACM&CFID=415399891&CFTOKEN=69514427

The URLDataStrategy class is a class specifically implemented for the large URL Reputation dataset, which is the data used for the results and archived at https://archive.ics.uci.edu/ml/datasets/URL+Reputation. This dataset is extremely sparse and stored in SVM light format, so the URLDataStrategy implements some unique ways of performing the dot product. See the following link to the paper for more information:

Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2009. Identifying suspicious URLs: an application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML '09). ACM, New York, NY, USA, 681-688. URL:http://cseweb.ucsd.edu/~savage/papers/ICML09.pdf

The RegularDataStrategy is the class that should be used for most logistic regression datasets, where all the data is in a single file, there are only two classes, the class decision is any of the rows (set this on the command line), and the classes are ordered sequentially (either 0/1, 1/2, 2/3, etc).

Using batch training on a 10 node cluster with all 3231961 features and 2396130 instances, batch training achieved 94% accuracy on a 20% test set after only approximately 6 minutes of training. Stochaistic gradient descent on a single node achieved 97.4% accuracy in approximately 1 hour of training, using a parallelized for loop over updates to the weight arrray.

Cluster description: Frontend supervisory node — tardis.cs.rit.edu Dual-core AMD Opteron 1214 processor 1.0 GHz clock 2 GB main memory Ubuntu 12.04 LTS 64-bit Linux Ten backend computational nodes — dr00 through dr09, each with: Two AMD Opteron 2218 processors Two CPU cores per processor Four threads 2.6 GHz clock 8 GB main memory Ubuntu 12.04 LTS 64-bit Linux 1-Gbps switched Ethernet backend interconnection network Aggregate 40 threads, 80 GB main memory

To run the programs (note that Batched versions of both Sequential and Parallel regression can be run with a command-line argument):

****Sequential version****

Usage: java pj2 LogRegSeq

is a real-valued data file ##################### The following are additional options available: [seed=long] value to seed the prng, default is 1L [testFile=file] specify test set, default is [alpha=double] specify learning rate, default is 0.05 [sep=char] specify separator, default is whitespace [class=int] specify class location, default is last column in row [eps=double] specify convergence threshold, defaults to 0.000001 [steps=long] specify number of iterations instead of eps convergence [thr=double] specify threhold of class decision, default is 0.5 [lambda=double] specify regularization parameter, default is 1.0 [snap=int] snapshot interval, default is 0 [records=int] number of records to process in the data, default is all data [batched=int] number of batches to use, default is none [gamma=double] gamma used for batch cost, default is all 100 [biter=int] iterations to use on one batch, default is 5 [test=double] percent (between 0-1) of train examples to set aside for test, default is 0 ##################### ****************Parallel version******************** Usage: java pj2 LogRegPar is a real-valued data file ##################### The following are additional options available: [chunked=boolean] to specify the data was chunked into sub-files on cluster nodes, default is true [seed=long] value to seed the prng, default is 1L [testFile=file] specify test set, default is [alpha=double] specify learning rate, default is 0.05 [sep=char] specify separator, default is whitespace [class=int] specify class location, default is last column in row [eps=double] specify convergence threshold, defaults to 0.000001 [steps=long] specify number of iterations instead of eps convergence [thr=double] specify threhold of class decision, default is 0.5 [lambda=double] specify regularization parameter, default is 1.0 [snap=int] snapshot interval, default is 0 [records=int] number of records to process in the data, default is all data [batched=int] number of batches to use, default is none [gamma=double] gamma used for batch cost, default is 100 [biter=int] iterations to use on one batch, default is 5 [test=double] percent (between 0-1) of train examples to set aside for test, default is 0 ##################### ****************SGD Sequential version******************** Usage: java LogRegSGD is a real-valued data file ##################### The following are additional options available: [seed=long] value to seed the prng, default is 1L [testFile=file] specify test set, default is [alpha=double] specify learning rate, default is 0.05 [sep=char] specify separator, default is whitespace [class=int] specify class location, default is last column in row [eps=double] specify convergence threshold, defaults to 0.000001 [steps=long] specify number of iterations instead of eps convergence [thr=double] specify threhold of class decision, default is 0.5 [snap=int] snapshot interval, default is 0 [records=int] number of records to process in the data, default is all data [converge=int] number of steps before checking convergence, default is 500 [test=double] percent (between 0-1) of train examples to set aside for test, default is 0 ##################### ****************SGD Parallel version******************** Usage: java LogRegParSGD is a real-valued data file ##################### The following are additional options available: [seed=long] value to seed the prng, default is 1L [testFile=file] specify test set, default is [alpha=double] specify learning rate, default is 0.05 [sep=char] specify separator, default is whitespace [class=int] specify class location, default is last column in row [eps=double] specify convergence threshold, defaults to 0.000001 [steps=long] specify number of iterations instead of eps convergence [thr=double] specify threhold of class decision, default is 0.5 [snap=int] snapshot interval, default is 0 [records=int] number of records to process in the data, default is all data [converge=int] number of steps before checking convergence, default is 500 [test=double] percent (between 0-1) of train examples to set aside for test, default is 0 #####################