jkrijthe / Rtsne

R wrapper for Van der Maaten's Barnes-Hut implementation of t-Distributed Stochastic Neighbor Embedding
Other
255 stars 66 forks source link

Error: protect(): protection stack overflow #23

Closed ChenMorSays closed 5 years ago

ChenMorSays commented 7 years ago

Hey there!

I'm getting the following error when trying to run t-SNE on a large matrix of about ~3 million records. Tried to increase the max-ppsize parameter R gets, but unfortunately no luck there. Any ideas how to fix that?

Thank a bunch! Chen

jkrijthe commented 7 years ago

Thanks for the bug report. Do you have a reproducible example to make it easier to debug the issue?

ChenMorSays commented 7 years ago

I do, I have the following script, which runs on a matrix of 3 Million columns, and about 150 rows:

################################## Install packages if required ##################################
list.of.packages <- c("ggplot2", "Rtsne")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(Rtsne)
##################################################################################################

#1. load and transpose the example matrix
exampleMatrix = read.csv('example.matrix', sep='\t', na.strings="NA", header=TRUE, row.names=1)

##we must normalize the example matrix and then transpose
example_t = t(scale(exampleMatrix, center = TRUE, scale = TRUE));
srs <- read.csv("all-available.txt", sep='\t',na.strings="NA", header=TRUE, row.names=1)

filter_example_matrix <- function (use_case_srs) {
  sr_annotations_with_example_t <- which( row.names(example_t) %in% row.names(use_case_srs) )
  use_case_samples <- example_t[sr_annotations_with_example_t,]

  return(use_case_samples)
}

use_case_samples <- filter_example_matrix(srs)
example_df <- as.data.frame(use_case_samples)## essential for K means
example_df.matrix <-data.matrix(example_df)

## Curating the database for analysis with both t-SNE and PCA
Labels <- row.names(example_df)

## for plotting
colors = rainbow(length(unique(Labels)))
names(colors) = unique(Labels)

# ~~~~~~~~~~~~~~~~~~~~~~~~~~ WILL CRASH AFTER THE FOLLOWING LINE! ~~~~~~~~~~~~~~~~~~~~~~~~~~

## Executing the algorithm on curated data
d_tsne_1 <- Rtsne(example_df[,-1], dims = 3, perplexity=30, verbose=TRUE, max_iter = 500)
#exeTimeTsne <- system.time(Rtsne(example_df[,-1], dims = 3, perplexity=30, verbose=TRUE, max_iter = 500))

## keeping original data
d_tsne_1_original=d_tsne_1

print("Executing k-means")

## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 3)

## Export clusters into a CSV file for verification purposes
tsne_clusters <- fit_cluster_kmeans$clusters
write.csv(tsne_clusters, file="tsne_clusters.csv")

d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)

## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))

## setting 3 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=3))  

#Plotting the cluster models onto t-SNE output
#Now time to plot the result of each cluster model, based on the t-SNE map.

plot_cluster=function(data, var_cluster, palette)  
{
  ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
    geom_point(size=0.25) +
    guides(colour=guide_legend(override.aes=list(size=6))) +
    xlab("") + ylab("") +
    ggtitle("") +
    theme_light(base_size=20) +
    theme(axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          legend.direction = "horizontal", 
          legend.position = "bottom",
          legend.box = "horizontal") + 
    scale_colour_brewer(palette = palette) 
}

plot_k=plot_cluster(d_tsne_1_original, "cl_kmeans", "Accent")  
plot_h=plot_cluster(d_tsne_1_original, "cl_hierarchical", "Set1")

## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)  
grid.arrange(plot_k, plot_h,  ncol=2)

## Export the plot into a PDF file for further analysis
pdf("tsne_quality_control_plot.pdf",width=7,height=5)
dev.off()
jkrijthe commented 7 years ago

Since I do not have the data, I can not use the script directly. I have also been unable to reproduce the behavior on my machine so far. My first guess would be that something goes wrong in the conversion of the data.frame to the matrix used in Rtsne. Have you tried running Rtsne with example_df.matrix instead of example_df? Anyway, it would be useful to make a simpler example script using generated data that causes the same bug in order for me to reproduce the problem.

jkrijthe commented 5 years ago

Unless we have more information/a reproducible example, I'm afraid I can't be of much help. Thanks for reporting the error.

swati051 commented 4 years ago

same issue .. please help !

jkrijthe commented 4 years ago

Do you have a reproducible example or more details that would help us track down the issue?

Troelsmou commented 2 years ago

I had the same issue as this one. Was able to fix it by converting to a numeric matrix. As in removing all non-integer or non-double columns and then converting through as.matrix.

jkrijthe commented 2 years ago

Thanks for the report and solution. Do you happen to have a small reproducible example for when this behaviour occurs, or was it specific to the dataset you were working on?

Troelsmou commented 2 years ago

I think its specific to the dataset. The dataset consists of 848259 integer columns with values either 0, 1 or 2, with 157 rows.

SamGG commented 2 years ago

I am quite dubious about using tSNE to represent a small number of points. OK, it works iris, but still. a) I would try to run tSNE with a couple seeds and lower the perplexity to check how much output is stable. b) When the number dimensions is higher than 50, a PCA is applied by default. I would try to compute it and give the result to Rtsne, which probably avoids all those strange behaviors I never encountered. c) When pre-computing PCA, I would use the 2 first components to set the Y_init parameters and check if early exaggeration need to be restored, as setting Y_init skip it. Check the doc about this.