Reproduce PSVM performance results

pgrew commented 7 years ago

Use PSVM to classify a ML data set. PSVM only supports binary classification, so the iris data set should either be modified or another data set should be chosen.

PSVM paper
- Binary datasets used in paper
splice dataset origin

ben-albrecht commented 7 years ago

From the PSVM wiki:

Step 0: Compile PSVM

On an XC, modify the Makefile for the following lines:

CC=CC # changed from mpicxx

# C-Compiler flags
CFLAGS=-O3 -Wall

# linker
LD=CC # changed from mpicxx
LFLAGS=-O3 -Wall

Load gnu PE:

> module load PrgEnv-gnu

.. and make it:

> make

Note: This compiles for MPI. I am not sure how to compile for serial execution.

Step 0.5: Download some test data

> wget -P data/ http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/splice.t
> wget -P data/ http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/splice

Step 1: Run PSVM

# Train
> mpiexec -n 4 ./svm_train -rank_ratio 0.1 -kernel_type 2 -hyper_parm 1 -gamma 0.01 \
                           -model_path $HOME/psvm/model data/splice

# Predict
> mpiexec -n 4 ./svm_predict -model_path $HOME/psvm/model data/splice.t

ben-albrecht commented 7 years ago

References

You can add these into the issue description if you'd like.

PSVM: Parallelizing Support Vector Machine on Distributed Computers
- Binary datasets including:
  - CoverType
  - RCV
- Still need the Image dataset...
splice dataset origin

pgrew commented 7 years ago

I am unable to find the Image dataset., and the RCV dataset has more than two classes, so I don't which two classes the authors used (PSVM only handles binary classification). For now, I will proceed with reproducing the performance of only the CoverType dataset.

ben-albrecht commented 7 years ago

I am unable to find the Image dataset., and the RCV dataset has more than two classes, so I don't which two classes the authors used (PSVM only handles binary classification). For now, I will proceed with reproducing the performance of only the CoverType dataset.

We could always contact the author...

pgrew commented 7 years ago

I think that is the right idea. I found this link http://groups.google.com/group/psvm?lnk=srg but I don't have my google credentials with me at this moment.

ben-albrecht commented 7 years ago

Looking again, it looks like all of the datasets used in the paper are available here, except for Image.

Q: Should I check these into datasets?

A: I think I'll check in the smaller datasets, and include a script that will wget and unpack the larger datasets.

ben-albrecht commented 7 years ago

and the RCV dataset has more than two classes,

The source says:

# of classes: 2

What leads you to believe there are more than 2 classes? The ID fields in the data file?

pgrew commented 7 years ago

What leads you to believe there are more than 2 classes? The ID fields in the data file?

I was working with a different RCV dataset. I will use your link. The number of training/testing samples are slightly off from table 1 in the paper, but it is likely the correct dataset.

pgrew commented 7 years ago

For completeness sake, here is the RCV dataset I was previously looking at: https://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+RCV2+Multilingual,+Multiview+Text+Categorization+Test+collection

pgrew / mbb

Reproduce PSVM performance results #6

References