predict.konenen is failing if the matrix is too big

wcornwell commented 2 months ago

from: @jack-bilby :

12: aperm.default(X, c(s.call, s.ans)) 11: aperm(X, c(s.call, s.ans)) 10: apply(x, 1, function(y) (sum(is.na(y))/length(y)) > maxNA.fraction) 9: which(apply(x, 1, function(y) (sum(is.na(y))/length(y)) > maxNA.fraction)) 8: FUN(X[[i]], ...) 7: lapply(data, function(x) which(apply(x, 1, function(y) (sum(is.na(y))/length(y)) > maxNA.fraction))) 6: check.data.na(newdata, maxNA.fraction = maxNA.fraction) 5: map.kohonen(object, newdata = newdata, whatmap = whatmap.new, ...) 4: map(object, newdata = newdata, whatmap = whatmap.new, ...) 3: predict.kohonen(MSOM, newdata = dd, whatmap = 1) 2: predict(MSOM, newdata = dd, whatmap = 1) at #10 1: classify_and_plot(1/60)

wcornwell commented 2 months ago

possible solution from @dfalster :

data |>

mutate(

chunk = seq(1, n(), by = 10)

) |>

split(~chunk) |>

purrr::map(~predict(MSOM, newdata = .x, whatmap = 1)) |> purrr::list_rbind()

wcornwell commented 2 months ago

possible solution here: 0ca8092f76fe60bdcea655674cc3dbb536478814

needs testing...

wcornwell commented 2 months ago

worked for me, @jack-bilby . Took about a little over an hour for the file you sent me

dfalster commented 2 months ago

Just noting the small MSOM object Will has used here won't give very meaningful results, and will likely be faster than using the full object.

If it's that slow, this could be an argument to parallelise this step?

jack-bilby commented 2 months ago

Will, is that output with that solution you suggested? Or with the original code?

I think it would be a good idea to at least have an option to chunk/parallelise the processes, especially for larger datasets.

wcornwell commented 2 months ago

Working on it

wcornwell commented 2 months ago

Looking like self organizing map predictions are not row-by-row. they take some kind complex window type thing. So when I chunk the imput file, I get edge effects on the chunks.

# Function to compare full data prediction vs chunk predictions
test_predict_kohonen_behavior <- function(dat, MSOM, chunk_size = 100) {
  # Step 1: Full dataset prediction
  full_data_matrix <- as.matrix(dat[, -1])
  full_prediction <- kohonen:::predict.kohonen(MSOM, newdata = full_data_matrix, whatmap = 1)
  full_activity <- full_prediction$predictions$activity

  # Step 2: Large chunk from the middle of the dataset
  middle_start <- nrow(dat) %/% 2 - chunk_size %/% 2
  middle_chunk <- dat[seq(from = middle_start, length.out = chunk_size), -1]
  middle_chunk_matrix <- as.matrix(middle_chunk)
  middle_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = middle_chunk_matrix, whatmap = 1)
  middle_chunk_activity <- middle_chunk_prediction$predictions$activity

  # Step 3: Edge case - Small chunk from the beginning of the dataset
  first_chunk <- dat[1:chunk_size, -1]
  first_chunk_matrix <- as.matrix(first_chunk)
  first_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = first_chunk_matrix, whatmap = 1)
  first_chunk_activity <- first_chunk_prediction$predictions$activity

  # Step 4: Edge case - Small chunk from the end of the dataset
  last_chunk <- dat[(nrow(dat) - chunk_size + 1):nrow(dat), -1]
  last_chunk_matrix <- as.matrix(last_chunk)
  last_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = last_chunk_matrix, whatmap = 1)
  last_chunk_activity <- last_chunk_prediction$predictions$activity

  # Step 5: Comparison of full dataset predictions with chunk predictions

  # Middle chunk comparison
  middle_full_activity <- full_activity[middle_start:(middle_start + chunk_size - 1)]
  middle_comparison <- middle_chunk_activity == middle_full_activity
  middle_na <- is.na(middle_chunk_activity) | is.na(middle_full_activity)

  cat("Middle chunk comparison:\n")
  if (all(middle_comparison[!middle_na])) {
    cat("Middle chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in middle chunk predictions at indices: ", which(!middle_comparison[!middle_na]), "\n")
  }

  # First chunk comparison
  first_full_activity <- full_activity[1:chunk_size]
  first_comparison <- first_chunk_activity == first_full_activity
  first_na <- is.na(first_chunk_activity) | is.na(first_full_activity)

  cat("First chunk comparison:\n")
  if (all(first_comparison[!first_na])) {
    cat("First chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in first chunk predictions at indices: ", which(!first_comparison[!first_na]), "\n")
  }

  # Last chunk comparison
  last_full_activity <- full_activity[(nrow(dat) - chunk_size + 1):nrow(dat)]
  last_comparison <- last_chunk_activity == last_full_activity
  last_na <- is.na(last_chunk_activity) | is.na(last_full_activity)

  cat("Last chunk comparison:\n")
  if (all(last_comparison[!last_na])) {
    cat("Last chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in last chunk predictions at indices: ", which(!last_comparison[!last_na]), "\n")
  }

  return(list(
    full_activity = full_activity,
    middle_chunk_activity = middle_chunk_activity,
    first_chunk_activity = first_chunk_activity,
    last_chunk_activity = last_chunk_activity,
    middle_comparison = middle_comparison,
    first_comparison = first_comparison,
    last_comparison = last_comparison
  ))
}

test_results <- test_predict_kohonen_behavior(dat, MSOM, chunk_size = 1000)

wcornwell commented 2 months ago

not sure what to do about that behavior. @jack-bilby @dfalster ?

https://en.wikipedia.org/wiki/Self-organizing_map

wcornwell commented 2 months ago

dfalster commented 2 months ago

Darn. Thanks for investigating. In that case, I can see these options

Get rid of the parallelisation, just run on a machine with more memory
Create chunks with overlaps to reduce edge effects, use the middle section when reconstructing (risky, untested)
Rewrite the whole predictive step in C++ (time consuming).

I reckon #1 is the way forward.

wcornwell commented 2 months ago

Yup.

Aside from the computational annoyance, @jack-bilby I think we can write the methods in a more informed way now.

dfalster commented 2 months ago

BTW - nice work implementing the parallelisation, and then testing for consistency. That as wise to check. shame it wasn't so early parallelisable.

wcornwell commented 2 months ago

i think it might be by design this "neighborhood" prediction thing.

surprisingly, it's not like random forest at all, more like CNN.

wcornwell commented 2 months ago

interesting for multiple projects from chatgpt:

Row-wise Prediction (Independent instances)

These methods predict based on individual rows, treating each instance as a separate feature vector without explicit context from neighboring rows:

Support Vector Machines (SVM): Classifies each row independently based on the feature space.
Random Forest (RF): Each row is classified independently using decision trees.
Gradient Boosting Machines (XGBoost, LightGBM): Like Random Forests, they classify each row independently after learning based on features.
K-Nearest Neighbors (KNN) (unless combined with DTW, see below): Each row is treated independently, but the prediction is based on the closest rows (neighbors) in the feature space, so it’s somewhat neighborhood-dependent but not sequential.

Neighborhood Prediction (Uses neighboring or temporal context)

These methods take the surrounding or neighboring data points into account when making predictions, making them better suited for sequential and time series data:

Recurrent Neural Networks (RNNs) (including LSTMs and GRUs): Predict by maintaining hidden states that store information from previous time steps, taking neighboring data points into account.
Convolutional Neural Networks (CNNs): Can predict using a "receptive field" that captures information from a window or neighborhood of time steps, making it neighborhood-aware.
Hidden Markov Models (HMMs): Each prediction depends on both the current observation and the hidden state, which is influenced by previous states, effectively using neighboring information.
Dynamic Time Warping (DTW) + KNN: Measures similarity between entire time series or subsequences (neighborhood) instead of individual rows. DTW accounts for time shifts between sequences, and KNN uses the most similar neighbors.
Autoencoders (especially temporal variants): Although often used for feature extraction, temporal autoencoders take neighboring time steps into account.
Self-Organizing Maps (SOMs): Neurons in the map represent clusters of similar data points. The proximity between neurons reflects the similarity of the input data, so neighboring neurons influence predictions.

Summary

Neighborhood-based: RNNs, CNNs, HMMs, DTW + KNN, autoencoders, SOMs.
Row-wise: SVM, Random Forest, Gradient Boosting, standard KNN.

Neighborhood methods are generally better suited for time series because they naturally capture the temporal structure and dependencies in the data.

traitecoevo / rabbit