feedback on structure_plot

stephens999 commented 3 years ago

since i tried this out today I thought I'd give some feedback:

i had some issues with the grouping I think because I used strings instead of a factor. Maybe check for this, or use as.factor to convert strings to factors?
Also had some issues with misspecifying the topics (eg more than were in the fit) and the wrong number of colors, and it errored out, but only after spending some time doing some calculations. Probably better to fail on these things before doing any calculations.
It isn't clear to a novice user what all those calculations are being done. Presumably it is computing an ordering? I wonder if there are faster ways for those who just want a quick plot?

pcarbo commented 3 years ago

Thanks for the suggested improvements. I admit that the existing interface is a bit rough and requires some rethinking.

pcarbo commented 3 years ago

@stephens999 I've overhauled the interface and implementation of the Structure plots. This new version should address all the issues you mentioned above. Please close the issue if you feel that it satisfactorily addresses these points otherwise let me know if you would like additional improvements.

Here are some examples illustrating several variations on the Structure plot:

library(fastTopics)
set.seed(1)
data(pbmc_facs)

# Get the multinomial topic model fitted to the
# PBMC data.
fit <- pbmc_facs$fit

# Create a Structure plot without labels. The samples (rows) are
# automatically arranged along the x-axis using t-SNE to highlight
# the structure in the data.
p1 <- structure_plot(fit)

# Create a Structure plot with the FACS cell-type labels. Within each
# group (cell-type), the cells (rows) are automatically arranged using
# t-SNE.
subpop <- pbmc_facs$samples$subpop
p2 <- structure_plot(fit,grouping = subpop)

# Next, we apply some customizations to improve the plot: (1) use the
# "topics" argument to specify the order in which the topic
# proportions are stacked on top of each other; (2) use the "gap"
# argrument to increase the whitespace between the groups; (3) use "n"
# to decrease the number of rows included in the Structure plot; and
# (4) use "colors" to change the colors used to draw the topic
# proportions.
topic_colors <- c("skyblue","forestgreen","darkmagenta",
                  "dodgerblue","gold","darkorange")
p3 <- structure_plot(fit,grouping = pbmc_facs$samples$subpop,gap = 20,
                     n = 1500,topics = c(5,6,1,4,2,3),colors = topic_colors)

# In this final example, we use UMAP instead of t-SNE to arrange all
# 3,744 cells in the Structure plot. Note that this can be
# accomplished in a different way by overriding the default setting of
# "embed_method".
y <- drop(umap_from_topics(fit,dims = 1))
p4 <- structure_plot(fit,rows = order(y),grouping = subpop,gap = 40,
                     colors = topic_colors)

pcarbo commented 3 years ago

@stephens999 @kevinlkx Reminding you again I would appreciate some feedback on the new improvements. Even basic feedback would be welcome.

stephens999 commented 3 years ago

this is a great improvement!

I think i have just one main suggestion. I think i understand the rows argument, but it took me a little while. This is partly because it is doing two things: specifying which rows to plot and also the order. I think it would be more powerful (and perhaps clearer) if the default behavior is to order according the "embed_method" and have an embed_method ="none" option that just uses the order of the samples (which is determined by rows if specified).

This way the "default" behavior of rows is just to say which rows to include in the plot -- they would be ordered as usual by the grouping and the embed method. And the current behaviour would be achieved by combining rows with embed_method = "none".

I guess in principle one might like to order the topics differently in each group? I'll leave that as a potential "enhancement"....

stephens999 commented 3 years ago

also is n equivalent to rows= sample(1:nrow(fit),replace=FALSE, n) ? I guess not... but which n rows are plotted if n is supplied?

pcarbo commented 3 years ago

Thanks for the feedback @stephens999. I think these issues you raise stem from the fact that there are 3 inputs (rows, n, embed_method), and the interface might be more clear if I collapse them into 2 inputs.

Yes I agree one might like to order the topics differently for each group, but ggplot2 does not allow this (or not easily, at least).

I'll wait to get thoughts from @kevinlkx before making changes.

kevinlkx commented 3 years ago

I tried this on my datasets. It works well in general.

My major question/comment is about the rows. If I just want to plot a subset of the samples (e.g. first 2000 samples), shall I set rows = 1:2000? I tried that in your p4 example as below, and the results look strange to me. p4 <- structure_plot(fit,rows = 1:2000, grouping = subpop,gap = 40, colors = topic_colors)

Similarly, in my mouse scATAC-seq data, we have many samples, so I just want to show a random subset of samples. In our earlier version, we had a select function on the poisson_nmf_fit object. But it seems that select function is no longer available. It seems setting rows to a subset of the samples may cause conflict with the grouping (for my data). Can you provide an example if people just want to show a subset of the samples.

Also, I think the name is a bit confusing because "rows" refer to the rows (orders) in the sample data matrix, but they are shown as the "columns" in the structure plot. maybe use a different name, such as samples, or sample_order.

It seems the t-SNE results look better than UMAP for my datasets, and the umap_from_topics function takes longer to run on my scATAC-seq data. I agreed with the suggestions from @stephens999 about the default ordering of the samples.
I prefer to have default order of the topics in the legend to be from 1 to the number of topics.

pcarbo commented 3 years ago

Thanks for the comments @kevinlkx!

It seems setting rows to a subset of the samples may cause conflict with the grouping (for my data).

The "grouping" argument should always specify the groups for all the samples. I'm not sure what you mean by "cause conflict with grouping"?

If I just want to plot a subset of the samples (e.g. first 2000 samples), shall I set rows = 1:2000?

No, you would use the "n" argument, e.g.,

p5 <- structure_plot(fit,n = 2000,grouping = subpop)

Also note that there is still a "select" function (and "select_loadings", which does the same thing).

I agree with the confusion in the "rows" argument. I will change the interface to make it less confusing.

pcarbo commented 3 years ago

@kevinlkx @stephens999 Following your suggestions I made a few changes to structure_plot. The main change is that I've removed the rows argument, and there is a new argument loadings_order = "embed" ("embed" is the default to make it clear that the ordering is determined by the 1-d t-SNE embedding).

I also made some improvements to help(structure_plot) to clarify some of the points of confusion you raised.

stephenslab / fastTopics

feedback on structure_plot #12