reptalex / phylofactor

26 stars 9 forks source link

Choice between var and F #39

Closed chris-krohn closed 2 years ago

chris-krohn commented 2 years ago

Hi Alex, I really enjoy your package and even created a tutorial for students or anyone interested. https://chrismitbiz.github.io/ABlab-workflows/phylofactor.html#phylofactor

This not an issue but more of a method question. My stat background is limited, although I understand the concepts and have done a lot of regression modelling. One of the questions I get sometimes, is how to decide between choice = "var" or "F". In the tutorial you state "The two default options for regression-based phylofactorization are choice='var', which maximizes the explained variance, and choice='F', which maximizes the F-statistic from regression (the ratio of explained to unexplained variance).

Can you help to explain in simple terms, under what circumstances it is appropriate to choose either "var" or "F"?

Thank you ! Cheers, Chris

reptalex commented 2 years ago

Hey Chris!

I'm touched to read your cool tutorial, and loved the figures where you overlaid taxonomies on the tree - so cool!

I tried to drop an image onto this comment box (fingers crossed it works!).

Big picture, choice="var" chooses lineages where the regression explains more variance in the data, where as the F-statistic in choice="F" picks the lineage with the highest ratio of explained/unexplained variance. I would use "var" to study differences in community composition (big changes in abundance!) whereas I always use "F" if I'm looking for bioindicator lineages (high signal/noise ratio).

In case the image below doesn't render, here's a visual of the difference between the kind of pattern picked up by "var" versus "F"

b = data.table(expand.grid('x'=rnorm(100),'stat'=c('var','F'))) b[stat=='var',y:=3*x+2*rnorm(100)] b[stat=='F',y:=x+.2*rnorm(100)]

ggplot(b,aes(x,y))+ geom_point()+ geom_smooth(method='glm')+ facet_wrap(.~stat)

Under the hood, we'll run a glm y~x, and the function getStats will compute "ExplainedVar" (choice='var') and "F".

fit_var=glm(y~x,data=b[stat=='var']) fit_F=glm(y~x,data=b[stat=='F']) phylofactor::getStats(fit_var) phylofactor::getStats(fit_F)

image

chris-krohn commented 2 years ago

Thanks Alex. Glad you like it :)

And thanks for the explanation and image. That really helped.