slowkow / harmonypy

🎼 Integrate multiple high-dimensional datasets with fuzzy k-means and locally linear adjustments.
https://portals.broadinstitute.org/harmony/
GNU General Public License v3.0
192 stars 22 forks source link

Running multiple instances of the same variable in vars_use. #16

Closed FFB-Lab closed 2 years ago

FFB-Lab commented 2 years ago

Hello, I have noticed that if I add multiple entries of the same .obs I want to correct for in vars_use, I will obtain different results, as if that .obs is being corrected for multiple times. The order used also affects results. For example:

Also, note that this is not the same as running harmony several times for the same variable, in which each separate run will remove the top PC from the observations. E.g.:

My questions are: what is it actually doing when I add multiple instances of the same observation? Also, why is the top PC removed for each successive harmonypy run?

slowkow commented 2 years ago

We should never set vars_use = ['batch', 'batch'].

Please consider sharing a reproducible example if you have questions about your own code.

FFB-Lab commented 2 years ago

Thank you for the quick answer. However, the questions I presented aren't necessarily about my own code, it's more about how harmonypy works in general. What does, in practical terms, inputting the same variable twice in vars_use do? Because I do observe some (arguably positive) changes, and looking at the source code in harmony.py I don't really understand why.

Thank you in advance.

slowkow commented 2 years ago

Have you tried reading the code and running each line, one by one?

If you run this line, you might understand why we should not be using vars_use = ['batch', 'batch']:

    phi = pd.get_dummies(meta_data[vars_use]).to_numpy().T

Don't forget to check out the documentation pages for functions like get_dummies().

FFB-Lab commented 2 years ago

I have run that line and all I get is duplicate dummy variable entries for those variables. I unfortunately do not possess the time to go over every single line to see how and why it affects the output, which is why I came here to ask what is happening when I do this, since I am seeing changes in the result, whereas I would expect to have none. I am sorry for wasting your precious time, but could you please clarify?

slowkow commented 2 years ago

This walkthrough might be helpful to build an understanding of each step in the Harmony algorithm: https://portals.broadinstitute.org/harmony/advanced.html

We shouldn't be losing PCs after running Harmony, but I can't help you with your code without seeing a reproducible example.

Regarding giving the same variable multiple times, consider the linear model:

lm(Sepal.Width ~ 1 + Sepal.Length + Sepal.Length, iris)

Does it make sense to give the same variable twice in a linear regression? If not, then why would we give the same variable twice for Harmony?