rgcca-factory / RGCCA

https://rgcca-factory.github.io/RGCCA/
10 stars 11 forks source link

Limit the number of variables to plot on correlation circle ? #78

Open PFRoux opened 5 months ago

PFRoux commented 5 months ago

Thanks a lot for this really awesome package.

Is there a way to put a cutoff in term of correlation or projection to the variables to select only part of them when using plot(fit, type = "cor_circle") ?

Thanks a lot.

Pierre-François

GFabien commented 5 months ago

Hi Pierre-François,

I am glad you like the package!

You are right. There is no current way of limiting the number of variables shown in the correlation circles. For other types of plots, we chose to have a n_mark parameter that limits the number of objects displayed. Would you like a similar feature for correlation circles, or would you prefer to give a cutoff value? Would you like both? I guess a cutoff would also work in all plots except the sample plot.

In the case of the n_mark solution, you decide if you want to sort the variables in descending order of correlations with the display_order argument and show only the first n_mark variables.

Fabien

PFRoux commented 5 months ago

Of course I love it ! It is just awesome and I am only just starting playing with it :-)

What would be great would be to be able to put thresholds for correlation and / or greatness of projection (cos2) for correlation circle and biology, in the way the author of factoextra implemented it for Factominer objects.

I guess I could try to implement something by forking the current version on GitHub. But I am really not proefficient at developing R packages.

Do you think this would be not-too-much time consuming / difficult to do ?

All the best and thanks again !

Pef

GFabien commented 5 months ago

Ok, I see that they have the select.var argument, which is a list where you can specify names, a cutoff value, or the number of variables. It makes sense to implement something like this.

I can do it, but feel free if you want to do it! The only file to change is plot.rgcca.R. I would probably remove the n_mark argument and replace it with the two arguments sample_select and var_select, which would be the equivalents of the arguments select.ind and select.var. Then, we need to implement a strategy for selecting the right samples/variables similarly to what is done in factoextra and replace the lines like

df <- df[seq(min(n_mark, NROW(df))), ]

with the new selection procedure. I would use the sample_select when impacting the number of samples (type = "samples, type = "biplot", and type = "both"), and var_select when impacting the number of variables (all types except "samples" and '"ave"`).

So please tell me what you prefer and if you agree with my proposition.

Best, Fabien

PFRoux commented 5 months ago

That sounds great !

I need to think a bit about it but I think this is exactly matching what I would love.

Another thing I thought about this afternoon is to implement an equivalent of the cim function in mixOmics package - which draw a heatmap with biclustering for selected variables, while projecting the class samples belong to, and blocks variables belong to. That would be super interesting for SGCCA. A circos plot might be useful as well.

I would be happy to give a hand on this.

What do you think ?

Thanks a lot !

Pef

PFRoux commented 5 months ago

Hi Fabien,

I hope that you're great.

I've started digging a bit in the code and I think it's a little bit out of my skills. I would definitely spend hours trying to develop what we've been thinking about, not really knowing the full structure of the package.

Do you think you can do it in a reasonable timing ? That would be awesome and so helpful.

Thanks a lot and have a great day.

++

Pef

GFabien commented 5 months ago

Hi Pef,

Yes, I could do it this week.

I looked at the cim function from mixOmics but had trouble visualizing the results on my laptop screen. It would require much more thinking and work, so we will wait to add it.

Best, Fabien

PFRoux commented 5 months ago

Thank you so much Fabien.

Here is a screenshot of the output you can obtain with the cim function of mixOmics.

image

Thanks again and have a great day. Let me know if I can help in any way.

Best,

Pef

GFabien commented 4 months ago

Hi @PFRoux

I'm sorry I didn't take the time last week. I've worked on it today. I thought of different modifications compared to what we discussed and would like your opinion on that.

First, instead of names "name", "cos2", and "contrib", I propose to use "name", "value", and "number". "name", takes precedence on "value", which itself takes precedence on "number". Setting "name" allows the user to get the variables or samples with matching names. "value" gives all the variables/samples with values higher than the given value (it can be a correlation value, but also something else depending on the involved plot). "number" provides the n first variables/samples.

Now, it leads to issues we did not have to consider before. Previously, we had a single value per variable/sample in the plots where n_mark could be used, so it was easy to select based on a value or choose to take the n highest values. With correlation circles, we have two values. What would be the most meaningful way to get a single value to base the selection? Simple solutions can be:

What do you think? @Tenenhaus, don't hesitate to let us know if you have any ideas.

Best, Fabien

PFRoux commented 4 months ago

Thank you so much for your feedback and sorry for the delay - I was in vacations for a couple of days. I really like your suggestion regarding the way to implement the selections of variables.

Regarding the selection process in 2D plots, what about offering 2 options : 1) putting a correlation threshold on either of the 2 component selected. 2) putting a projection threshold in the same flavor as the cos2 cutoff in PCA.

Do you think it would be difficult to implement ?

Thanks a lot for your help.

Pef

GFabien commented 4 months ago

I am trying to understand what their cos2 represents. It is a normalization of the weight associated with each variable in each component, but I did not see the formula for how it is calculated. Because of the constraints in our models, the weights are already between -1 and 1 (or 0 and 1 if you square them), so we do not need an extra normalization. Do you have a specific interpretation of this cos2 measure?

There would not be too many differences between putting a correlation threshold on either of the two components or putting a threshold on the distance to the center. So, proposition 1 makes perfect sense.

I do not think it would be difficult to implement, but I do not see the point in implementing both propositions. What would be the benefit of having both instead of just proposition 1?

Best, Fabien

PFRoux commented 4 months ago

Hi Fabien,

I am facing the same problem as you finding out what the cos2 really correspond to - I cannot find the formula. But as you mention, putting a threshold on the correlation on either of the 2 components should be really effective already. So implementing the solution 1 only should do the trick :-)

Best,

Pef

GFabien commented 4 months ago

Hi @PFRoux you can try out this branch https://github.com/rgcca-factory/RGCCA/tree/limit_number_of_var_in_plots to see if it suits your needs!

Best, Fabien

PFRoux commented 4 months ago

Thank you so much @GFabien for your help. I'll have a try and keep you posted

Best,

Peu

GFabien commented 1 month ago

Hi @PFRoux, How are you? Did you have the opportunity to try the new implementation? Best, Fabien