Closed FFB-Lab closed 2 years ago
We should never set vars_use = ['batch', 'batch']
.
Please consider sharing a reproducible example if you have questions about your own code.
Thank you for the quick answer. However, the questions I presented aren't necessarily about my own code, it's more about how harmonypy works in general. What does, in practical terms, inputting the same variable twice in vars_use do? Because I do observe some (arguably positive) changes, and looking at the source code in harmony.py I don't really understand why.
Thank you in advance.
Have you tried reading the code and running each line, one by one?
If you run this line, you might understand why we should not be using vars_use = ['batch', 'batch']
:
phi = pd.get_dummies(meta_data[vars_use]).to_numpy().T
Don't forget to check out the documentation pages for functions like get_dummies().
I have run that line and all I get is duplicate dummy variable entries for those variables. I unfortunately do not possess the time to go over every single line to see how and why it affects the output, which is why I came here to ask what is happening when I do this, since I am seeing changes in the result, whereas I would expect to have none. I am sorry for wasting your precious time, but could you please clarify?
This walkthrough might be helpful to build an understanding of each step in the Harmony algorithm: https://portals.broadinstitute.org/harmony/advanced.html
We shouldn't be losing PCs after running Harmony, but I can't help you with your code without seeing a reproducible example.
Regarding giving the same variable multiple times, consider the linear model:
lm(Sepal.Width ~ 1 + Sepal.Length + Sepal.Length, iris)
Does it make sense to give the same variable twice in a linear regression? If not, then why would we give the same variable twice for Harmony?
Hello, I have noticed that if I add multiple entries of the same .obs I want to correct for in vars_use, I will obtain different results, as if that .obs is being corrected for multiple times. The order used also affects results. For example:
vars_use = ['phase',"batch","batch","batch"] will seemingly correct for phase once and then batch three times, whereas
vars_use = ['batch',"batch","batch","phase"] will correct batch three times and then for phase. The results between these two runs are different.
Also, note that this is not the same as running harmony several times for the same variable, in which each separate run will remove the top PC from the observations. E.g.:
My questions are: what is it actually doing when I add multiple instances of the same observation? Also, why is the top PC removed for each successive harmonypy run?