Clustering order supersedes manual groups

micwij commented 1 year ago

I was trying to use tidyHeatmap to make heatmaps of metabolomics data, when I noticed a strange behaviour that rows escaped their manually assigned grouping and ended up in the wrong grouping. It is a bit tricky to explain, so I am providing a small example here:

example <- tribble(~Compound_Name, ~Compound_Class, ~col, ~log2fc, "L-homoserineAA", "AA", 1, 2.93, "cellobioseCH", "CH", 1, 2.09, "D-maltoseCH", "CH", 1, 3.08, "pectinCH", "CH", 1, -3.04, "raffinoseCH", "CH", 1, -2.10)

example %>% group_by(Compound_Class) %>% heatmap(.row = Compound_Name, .col = col, .value = log2fc)

example2 <- example %>% mutate(Compound_Name = as_factor(Compound_Name))

example2 %>% group_by(Compound_Class) %>% heatmap(.row = Compound_Name, .col = col, .value = log2fc)

AA stands for amino acid and CH stands for carbohydrate (this is not important for the understanding of the issue, just to provide some context). I also added the compound class to the end of the compound name.

When the .row variable is just a character vector D-maltoseCH is switched with L-homoserine and both show up in the wrong group (putatively due to the clustering by the value?) example

When mutating Compound_Name into a factor they both get correctly assigned: example2

I don't know if this is an issue of tidyHeatmap or of the underlying ComplexHeatmap package but I think it would be important to find out and fix this behavior. Transforming the .row variable to a factor seems to work but I am not sure whether this is how this vector is most commonly used.

Let me know if something is unclear.

Sorry for this somewhat strange example. I tried to recreate the example with mtcars or diamonds but I wasn't able to achieve this strange behavior.

stemangiola commented 1 year ago

Thanks for the heads up. If you could check if the behaviour occurs

having two columns
factoring and grouping (rather the other way around)

micwij commented 1 year ago

Thanks for the heads up. If you could check if the behaviour occurs

having two columns

Yes this also occurs also with two or more columns (my original data has more than 10 columns). Here is a replacement for the example above, where I added a second column and modified the values slightly.

example <- tribble(~Compound_Name, ~Compound_Class, ~col, ~log2fc, "L-homoserineAA", "AA", 1, 2.93, "cellobioseCH", "CH", 1, 2.09, "D-maltoseCH", "CH", 1, 1.08, "pectinCH", "CH", 1, -3.04, "raffinoseCH", "CH", 1, -2.10, "L-homoserineAA", "AA", 2, -2.10, "cellobioseCH", "CH", 2, -3.04, "D-maltoseCH", "CH", 2, 1.08, "pectinCH", "CH", 2, 2.09, "raffinoseCH", "CH", 2, 2.93)

Upon modifying the values, it seems that the issue might not stem from the clustering after all, so maybe it is related to the names?

factoring and grouping (rather the other way around)

I think this is what I did in example2 above, or how do you mean it? Indeed in this case the behavior does not occur. When grouping by the variable as character vector and then mutating it into a factor the behavior still occurs. E.g.:

example %>% group_by(Compound_Class) %>% mutate(Compound_Class = as_factor(Compound_Class)) %>% heatmap(.row = Compound_Name, .col = col, .value = log2fc)

stemangiola commented 1 year ago

Puzzling.. Right now, I don't have the throughput to debug the function. I will put it on the do-to list. If you happen to want to give it a shot, you might be able to fix the bug in a short time and become part of the tidy* family ;)

micwij commented 1 year ago

Thanks for the heads up. If you could check if the behaviour occurs

having two columns

Yes this also occurs also with two or more columns (my original data has more than 10 columns). Here is a replacement for the example above, where I added a second column and modified the values slightly.

example <- tribble(~Compound_Name, ~Compound_Class, ~col, ~log2fc, "L-homoserineAA", "AA", 1, 2.93, "cellobioseCH", "CH", 1, 2.09, "D-maltoseCH", "CH", 1, 1.08, "pectinCH", "CH", 1, -3.04, "raffinoseCH", "CH", 1, -2.10, "L-homoserineAA", "AA", 2, -2.10, "cellobioseCH", "CH", 2, -3.04, "D-maltoseCH", "CH", 2, 1.08, "pectinCH", "CH", 2, 2.09, "raffinoseCH", "CH", 2, 2.93)

Upon modifying the values, it seems that the issue might not stem from the clustering after all, so maybe it is related to the names?

Small addition: I just removed the "D-" and "L-" from "D-maltoseCH" and "L-homoserineAA" and indeed the behavior does not appear. Hope this info helps in finding the issue.

Of course, those are globally not the most common names, but these are quite common in metabolomics and I could imagine similar names for e.g. cell lines, or strains, so I think this is still worth looking into.

micwij commented 1 year ago

Puzzling.. Right now, I don't have the throughput to debug the function. I will put it on the do-to list. If you happen to want to give it a shot, you might be able to fix the bug in a short time and become part of the tidy* family ;)

Sure. No worries and no hurry! I might try to look into it but I am not sure if I am experienced enough to solve it. I will report it here if I find anything.

AleksZakirov commented 6 months ago

I can confirm that the issue can be fixed by converting the variable into a factor. I tried replacing all dots, spaces and dash characters with underscores, thinking that it could somehow be related to that, but this made no difference. But converting to factor works for now.

stemangiola commented 6 months ago

Can you please send me the list of variables, in their simplest form, where they fail if not transformed into factors? This bit puzzles me a lot.

Try to get them in the simplest form and the smallest number where the error appears, we might be able to identify what is the cause. We need to fix this.

stemangiola commented 6 months ago

Hello all, thanks for bringing this to our attention. We will have a dedicated person for tidyomics who will also maintain tidyHeatmap.

Hopefully, this will happen soon.

stemangiola commented 5 months ago

on it..

stemangiola commented 5 months ago

I can confirm that the issue can be fixed by converting the variable into a factor. I tried replacing all dots, spaces and dash characters with underscores, thinking that it could somehow be related to that, but this made no difference. But converting to factor works for now.

Just to clarify I fixed converting the row names into factor. But I am going to fix the source problem anyway.

stemangiola / tidyHeatmap

Clustering order supersedes manual groups #116