Open wcornwell opened 2 months ago
possible solution from @dfalster :
data |>
mutate(
chunk = seq(1, n(), by = 10)
) |>
split(~chunk) |>
purrr::map(~predict(MSOM, newdata = .x, whatmap = 1)) |> purrr::list_rbind()
possible solution here: 0ca8092f76fe60bdcea655674cc3dbb536478814
needs testing...
worked for me, @jack-bilby . Took about a little over an hour for the file you sent me
Just noting the small MSOM object Will has used here won't give very meaningful results, and will likely be faster than using the full object.
If it's that slow, this could be an argument to parallelise this step?
Will, is that output with that solution you suggested? Or with the original code?
I think it would be a good idea to at least have an option to chunk/parallelise the processes, especially for larger datasets.
Working on it
Looking like self organizing map predictions are not row-by-row. they take some kind complex window type thing. So when I chunk the imput file, I get edge effects on the chunks.
# Function to compare full data prediction vs chunk predictions
test_predict_kohonen_behavior <- function(dat, MSOM, chunk_size = 100) {
# Step 1: Full dataset prediction
full_data_matrix <- as.matrix(dat[, -1])
full_prediction <- kohonen:::predict.kohonen(MSOM, newdata = full_data_matrix, whatmap = 1)
full_activity <- full_prediction$predictions$activity
# Step 2: Large chunk from the middle of the dataset
middle_start <- nrow(dat) %/% 2 - chunk_size %/% 2
middle_chunk <- dat[seq(from = middle_start, length.out = chunk_size), -1]
middle_chunk_matrix <- as.matrix(middle_chunk)
middle_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = middle_chunk_matrix, whatmap = 1)
middle_chunk_activity <- middle_chunk_prediction$predictions$activity
# Step 3: Edge case - Small chunk from the beginning of the dataset
first_chunk <- dat[1:chunk_size, -1]
first_chunk_matrix <- as.matrix(first_chunk)
first_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = first_chunk_matrix, whatmap = 1)
first_chunk_activity <- first_chunk_prediction$predictions$activity
# Step 4: Edge case - Small chunk from the end of the dataset
last_chunk <- dat[(nrow(dat) - chunk_size + 1):nrow(dat), -1]
last_chunk_matrix <- as.matrix(last_chunk)
last_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = last_chunk_matrix, whatmap = 1)
last_chunk_activity <- last_chunk_prediction$predictions$activity
# Step 5: Comparison of full dataset predictions with chunk predictions
# Middle chunk comparison
middle_full_activity <- full_activity[middle_start:(middle_start + chunk_size - 1)]
middle_comparison <- middle_chunk_activity == middle_full_activity
middle_na <- is.na(middle_chunk_activity) | is.na(middle_full_activity)
cat("Middle chunk comparison:\n")
if (all(middle_comparison[!middle_na])) {
cat("Middle chunk predictions match full data.\n")
} else {
cat("Discrepancies found in middle chunk predictions at indices: ", which(!middle_comparison[!middle_na]), "\n")
}
# First chunk comparison
first_full_activity <- full_activity[1:chunk_size]
first_comparison <- first_chunk_activity == first_full_activity
first_na <- is.na(first_chunk_activity) | is.na(first_full_activity)
cat("First chunk comparison:\n")
if (all(first_comparison[!first_na])) {
cat("First chunk predictions match full data.\n")
} else {
cat("Discrepancies found in first chunk predictions at indices: ", which(!first_comparison[!first_na]), "\n")
}
# Last chunk comparison
last_full_activity <- full_activity[(nrow(dat) - chunk_size + 1):nrow(dat)]
last_comparison <- last_chunk_activity == last_full_activity
last_na <- is.na(last_chunk_activity) | is.na(last_full_activity)
cat("Last chunk comparison:\n")
if (all(last_comparison[!last_na])) {
cat("Last chunk predictions match full data.\n")
} else {
cat("Discrepancies found in last chunk predictions at indices: ", which(!last_comparison[!last_na]), "\n")
}
return(list(
full_activity = full_activity,
middle_chunk_activity = middle_chunk_activity,
first_chunk_activity = first_chunk_activity,
last_chunk_activity = last_chunk_activity,
middle_comparison = middle_comparison,
first_comparison = first_comparison,
last_comparison = last_comparison
))
}
test_results <- test_predict_kohonen_behavior(dat, MSOM, chunk_size = 1000)
not sure what to do about that behavior. @jack-bilby @dfalster ?
Darn. Thanks for investigating. In that case, I can see these options
I reckon #1 is the way forward.
Yup.
Aside from the computational annoyance, @jack-bilby I think we can write the methods in a more informed way now.
BTW - nice work implementing the parallelisation, and then testing for consistency. That as wise to check. shame it wasn't so early parallelisable.
i think it might be by design this "neighborhood" prediction thing.
surprisingly, it's not like random forest at all, more like CNN.
interesting for multiple projects from chatgpt:
These methods predict based on individual rows, treating each instance as a separate feature vector without explicit context from neighboring rows:
These methods take the surrounding or neighboring data points into account when making predictions, making them better suited for sequential and time series data:
Recurrent Neural Networks (RNNs) (including LSTMs and GRUs): Predict by maintaining hidden states that store information from previous time steps, taking neighboring data points into account.
Convolutional Neural Networks (CNNs): Can predict using a "receptive field" that captures information from a window or neighborhood of time steps, making it neighborhood-aware.
Hidden Markov Models (HMMs): Each prediction depends on both the current observation and the hidden state, which is influenced by previous states, effectively using neighboring information.
Dynamic Time Warping (DTW) + KNN: Measures similarity between entire time series or subsequences (neighborhood) instead of individual rows. DTW accounts for time shifts between sequences, and KNN uses the most similar neighbors.
Autoencoders (especially temporal variants): Although often used for feature extraction, temporal autoencoders take neighboring time steps into account.
Self-Organizing Maps (SOMs): Neurons in the map represent clusters of similar data points. The proximity between neurons reflects the similarity of the input data, so neighboring neurons influence predictions.
Neighborhood methods are generally better suited for time series because they naturally capture the temporal structure and dependencies in the data.
from: @jack-bilby :