kassambara / factoextra

Extract and Visualize the Results of Multivariate Data Analyses
354 stars 102 forks source link

add loading colors in legend (feature request) #27

Closed ginolhac closed 7 years ago

ginolhac commented 7 years ago

Me again,

Now that the invisible option is fixed in #26 (thanks again!), my goal is to have some colors for the quanti.sup while hiding the variables (or loadings). This is working fine, but that would be great to add them to the legend. In my case, the quanti.sup names are experiments and the colors should be the treated cells.

The ellipses are filled, so that take the fill legend. Great. The remaining issue is the color of individuals that should be let's say black, otherwise I cannot get the legend for the quanti.sup.

A plot explains better the problem

pca_deca <- PCA(decathlon2, scale.unit = TRUE, graph = FALSE, quanti.sup = 11:12, quali.sup = c(13))
fviz_pca_biplot(pca_deca, invisible = "var", habillage = "Competition",
                addEllipses = TRUE, col.ind = "black", pointshape = 19,
                col.quanti.sup = c("purple", "darkblue"))


see that the quanti.sup are properly colored but don't show up in the legend. And my attempt to use "black" for indiv was a bit naive.

Since, I am not sure how to solve this, here is a toy example of what should be achieved

# example adapted from this answer
# http://stackoverflow.com/a/20291006/1395352
pca    <- prcomp(iris[, 1:4], retx = TRUE, scale. = TRUE) # scaled pca [exclude species col]
pca_iris <- PCA(iris[, 1:4], graph = FALSE)
var_iris <- pca_iris$var$coord %>%
  as.data.frame() %>%
  rownames_to_column(var = "var") %>%
  separate(var, into = c("flower", "measure"), sep = "\\.") %>%
scores <- pca$x[, 1:3]                        # scores for first three PC's

# k-means clustering [assume 3 clusters]
km     <- kmeans(scores, centers = 3, nstart = 5)
ggdata <- data.frame(scores, Cluster = km$cluster, Species = iris$Species)
# get some custom colors
my_col_var <- ggsci::pal_npg("nrc")(4)
my_col_ell <- ggsci::pal_uchicago()(3)

ggplot(ggdata) +
  geom_point(aes(x = PC1, y = PC2, shape = factor(Cluster)), size = 2) +
  stat_ellipse(aes(x = PC1, y = PC2, fill = factor(Cluster)),
               geom = "polygon", level = 0.95, alpha = 0.4) +
  geom_segment(data = var_iris, aes(x = 0, xend = Dim.1 * 2, colour = flower,
                                    y = 0, yend = Dim.2 * 2), size = 1.2, arrow = arrow(length = unit(0.03, "npc"))) +
  geom_text(data = var_iris, aes(x = Dim.1 * 2, colour = flower, label = measure,
                                 y = Dim.2 * 2), nudge_x = 0.2, nudge_y = 0.3, show.legend = FALSE) +
  scale_fill_manual(values = my_col_ell) +
  scale_colour_manual(values = my_col_var) +
  labs(fill = "cluster",
       shape = "cluster",
       colour = "loadings") +


see that allows to add more information and reduce the text length. The shape mapping is not mandatory I think.

ginolhac commented 7 years ago

I wrote loadings, but I should have written quanti.sup for my specific need. However, for both it would be useful I guess.

kassambara commented 7 years ago

I really appreciate this very well written request.

The idea is to be able to color variables (active and supplementary) by groups so that they will appear in the legends.

I think that this is an interesting feature and I will implement it as soon as possible.

Let me know If you have any other suggestions.

Have a great day, /A

ginolhac commented 7 years ago

thanks a lot! You summarized very the (long) request. I have more ideas but will open separate issues later. After watching François Husson talking about PCA, the real diagnostic power of PCA enlighten me! Like if you see genes that belong to one group but found in another one, you can investigate further. Or genes that clearly belong to one group but were not included. Very great tool. Have a great day too!

ginolhac commented 7 years ago

Hello @kassambara, any chance you have time to look into this feature request?

kassambara commented 7 years ago

I think that the current developmental version of factoextra includes already a quick solution to your question.

Please install the latest developmental version and try this:

res.pca <- prcomp(iris[, -5],  scale = TRUE)
fviz_pca_biplot(res.pca, label = "var",
             col.ind = iris$Species,
             col.var = c("sepal", "sepal", "petal", "petal"),
             repel = TRUE,
             palette = "jco",
             legend.title = "Group"

What do you think about that?

ginolhac commented 7 years ago

that is nice indeed, but I'd like the col.var to be in the legend on its own. A trick I used before is to use a shape = 21 for points so the fill argument is for coloring and let the colour one for loadings.

see from the example above

ggplot(ggdata) +
  geom_point(aes(x = PC1, y = PC2, fill = factor(Species)), size = 2, shape = 21, colour = "grey90") +
  geom_segment(data = var_iris, aes(x = 0, xend = Dim.1 * 2, colour = flower,
                                    y = 0, yend = Dim.2 * 2), size = 1.2, arrow = arrow(length = unit(0.03, "npc"))) +
  geom_text(data = var_iris, aes(x = Dim.1 * 2, colour = flower, label = measure,
                                 y = Dim.2 * 2), nudge_x = 0.2, nudge_y = 0.3, show.legend = FALSE) +
  scale_fill_manual(values = my_col_ell) +
  scale_colour_manual(values = my_col_var) +
  labs(fill = "cluster",
       shape = "cluster",
       colour = "loadings") +


kassambara commented 7 years ago

New arguments fill.var and fill.ind added.

The following R code should work:

res.pca <- prcomp(iris[, -5],  scale = TRUE)

                # Fill individuals by groups
                geom.ind = "point",
                pointshape = 21,
                pointsize = 2,
                fill.ind = iris$Species,
                col.ind = "white",

                # Color variable by groups
                col.var = factor(c("sepal", "sepal", "petal", "petal")),

                repel = TRUE
  labs(fill = "Species", color = "Clusters")


kassambara commented 7 years ago

After installing the latest developmental version of ggpubr and factoextra, the following R code should work:

res.pca <- prcomp(iris[, -5],  scale = TRUE)
                # Individuals
                geom.ind = "point",
                fill.ind = iris$Species, col.ind = "white",
                pointshape = 21, pointsize = 2,
                palette = "jco",
                addEllipses = TRUE,
                # Variables
                alpha.var ="contrib", col.var = "contrib",
                gradient.cols = "RdBu"
  labs(fill = "Species", color = "Contrib", alpha = "Contrib") # Change legend title


ginolhac commented 7 years ago

Looks superb! Thanks a lot for your much appreciated efforts.