snowch / biginsight-examples

Example projects to help you quickly get started with BigInsights
Apache License 2.0
7 stars 4 forks source link

Add BigR LinearRegression example #54

Open snowch opened 8 years ago

snowch commented 8 years ago

I.e.

#################################################################
#3. Machine Learning example: building a Linear Regression model
#################################################################

# Remove files from previous executions (if any)
invisible(bigr.rmfs("/user/bigr/examples/airline.sample.* /user/bigr/examples/lm.airline*"))

# Project some relevant columns for modeling / statistical analysis
airlineFiltered <- air[, c("Month", "DayofMonth", "DayOfWeek", "CRSDepTime",
                                "Distance", "ArrDelay")]

# Create a bigr.matrix from the data
airlineMatrix <- bigr.transform(airlineFiltered,
                               outData="/user/bigr/examples/airline.sample.matrix",
                               transformPath="/user/bigr/examples/airline.sample.transform")

# Split the data into 70% for training and 30% for testing
samples <- bigr.sample(airlineMatrix, perc=c(0.7, 0.3))
train <- samples[[1]]
test <- samples[[2]]

# Create a linear regression model
lm <- bigr.lm(ArrDelay ~ ., data=train, directory="/user/bigr/examples/lm.airline")

# Get the coefficients of the regression
coef(lm)

# Calculate predictions for the testing set
pred <- predict(lm, test, "/user/bigr/examples/lm.airline.preds")

## End(Not run)

Code taken from here: https://www.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.bigr.doc/doc/intro.html?cp=SSPT3X_4.0.0%2F9-1

snowch commented 8 years ago

There is an issue running the example code:

Error: BigR[bigr.transform]: Unhandled missing values were found in the dataset. Missing values must must be handled through imputation (parameter missingAttrs) or omitted (parameter omit.na). To find out which columns contain missing values, use function bigr.which.na.cols().
pregazzoni commented 8 years ago

So there might be some missing value in the data set somehow and need imputation method to know what to replace those n/a value with?

snowch commented 8 years ago

Yeah. I was a bit surprised by this failure because the example code is straight from the documentation and uses an example dataset provided with the cluster.