lamho86 / phylolm

GNU General Public License v2.0
30 stars 12 forks source link

Error running phyloglm, number of species #18

Closed diogoprov closed 5 years ago

diogoprov commented 5 years ago

Dear @lamho86 and @cecileane , I'm trying to use phylolm::phyloglmto run a phylogenetic logistic regression since yesterday, but I keep getting these error messages saying that the number of rows (species) in the data and in the phylogeny don't match. However, they do have the same number of species, and I was careful enough to get the rows of the data frame in the same order of the tip.labels, to match the function example. I really don't know what can be wrong. Please, see below the code chunk:

> phylogeny

Phylogenetic tree with 800 tips and 799 internal nodes.

Tip labels:
    Neoromicia_roseveari, Neoromicia_brunneus, Neoromicia_tenuipinnis, Neoromicia_rendalli, Neoromicia_nanus, Laephotis_wintoni, ...

Rooted; includes branch lengths.

> dim(dados)
[1] 800  11

> head(dados)
                                Familia Presenca.ecolocalizacao Ref.ecol
Neoromicia_roseveari   Vespertilionidae                     SIM       NA
Neoromicia_brunneus    Vespertilionidae                     SIM       NA
Neoromicia_tenuipinnis Vespertilionidae                     SIM       NA
Neoromicia_rendalli    Vespertilionidae                     SIM       NA
Neoromicia_nanus       Vespertilionidae                     SIM       NA
Laephotis_wintoni      Vespertilionidae                     SIM       NA
                       Atvidade.Circadiana             Ref.Atv
Neoromicia_roseveari                  <NA>                <NA>
Neoromicia_brunneus                NOTURNO                <NA>
Neoromicia_tenuipinnis             NOTURNO EltonTraitsDataBase
Neoromicia_rendalli                NOTURNO EltonTraitsDataBase
Neoromicia_nanus                   NOTURNO                <NA>
Laephotis_wintoni                  NOTURNO EltonTraitsDataBase
                            Dieta            Ref.Diet
Neoromicia_roseveari   INSETIVORO                <NA>
Neoromicia_brunneus    INSETIVORO                <NA>
Neoromicia_tenuipinnis INSETIVORO EltonTraitsDataBase
Neoromicia_rendalli    INSETIVORO EltonTraitsDataBase
Neoromicia_nanus       INSETIVORO                <NA>
Laephotis_wintoni      INSETIVORO EltonTraitsDataBase
                       Tipo.ecolocalizacao Ref.tipoEco Diet_frug_disc
Neoromicia_roseveari               LARINGE          NA   NAOFRUGIVORO
Neoromicia_brunneus                LARINGE          NA   NAOFRUGIVORO
Neoromicia_tenuipinnis             LARINGE          NA   NAOFRUGIVORO
Neoromicia_rendalli                LARINGE          NA   NAOFRUGIVORO
Neoromicia_nanus                   LARINGE          NA   NAOFRUGIVORO
Laephotis_wintoni                  LARINGE          NA   NAOFRUGIVORO
                       Ati_circad_disc
Neoromicia_roseveari              <NA>
Neoromicia_brunneus            NOTURNO
Neoromicia_tenuipinnis         NOTURNO
Neoromicia_rendalli            NOTURNO
Neoromicia_nanus               NOTURNO
Laephotis_wintoni              NOTURNO

> eco_bat <- phyloglm(Presenca.ecolocalizacao~Diet_frug_disc, data=dados, phy=phylogeny, method = "logistic_IG10", boot=100)
Error in phyloglm(Presenca.ecolocalizacao ~ Diet_frug_disc, data = dados,  : 
  the number of rows in the data does not match the number of tips in the tree.

Thank you in advance Merci beaucoup pour votre attention

Diogo

diogoprov commented 5 years ago

P.S. My phylogeny is not ultrametric

> is.ultrametric(tre)
[1] FALSE

But the one from the example isn't either. I really don't know what's happening.

cecileane commented 5 years ago

It might be from some non-ASCII character that creeped into a taxon name, either in the trait data, or in the tree. The difference could be invisible by eye, but R would still see it...

cecileane commented 5 years ago

Hi Diogo, please feel free to send your files offline (tree file and and part of trait file --just 1 column, say). I suspect that it's caused by space or non-ASCII characters, but I can't really help without being able to reproduce the error. Another possibility is that some species name is spelled differently in the tree and in the data file. To check, you could do this:

setdiff(rownames(dados), phylogeny$tip.label)

to see which taxa have trait data but are not in your tree.
And to see which taxa are in your tree but are not in your trait data:

setdiff(phylogeny$tip.label, rownames(dados))

Regardless of where the problem comes from, knowing which taxa do not match could be helpful.

diogoprov commented 5 years ago

Hi @cecileane , I'm sorry for getting so late to this. Thank you very much for the suggestion. I'll try it here and if it doesn't work properly, I'll send you a shortened version of the data and the code I'm using.

All the best, Diogo

diogoprov commented 5 years ago

Yes, we just checked for any difference in species names in the phylogeny and trait data table, but everything looked fine. The code

setdiff(phylogeny$tip.label, rownames(dados))

returned

NULL

We did other analysis with the same dataset (e.g., ancestral state estimation) and nothing happened. We'll arrange the table to send you a shortened version of it.

Thanks again

cecileane commented 5 years ago

this is strange: getting NULL suggests that the first vector (or both vectors) is (are) empty. examples:

> setdiff(c(1,4,5), c(5,1,4,3))
numeric(0)
> setdiff(c(), c())
NULL
> setdiff(NULL, NULL)
NULL
> setdiff(c(1), NULL)
[1] 1
> setdiff(NULL, c(1))
NULL
cecileane commented 5 years ago

Thank you @diogoprov for diagnosing this: the problem was due to some taxa missing data for some of the predictors. The entire data frame has the same taxa (as row names) as in the tree, but after ignoring taxa that were missing some data (response or predictor), the data frame reduced to taxa with no missing entry had a smaller taxon set than the tree. For others running into the same problem, here is a small example to see if this is an issue, and a quick fix.

first, set-up a small data set with no missing data, and everything goes well

tre1 = read.tree(text="(((((t19:0.952,t17:0.872):0.0803,(t18:0.605,(t16:0.74,t15:0.896):0.225):0.712):0.824,((t3:0.435,t5:0.575):0.515,((t11:0.362,(t2:0.749,t1:0.507):0.335):0.678,(t12:0.76,t10:0.964):0.862):0.741):0.294):0.119,((t4:0.267,(t14:0.196,t6:0.0196):0.176):0.837,t7:0.528):0.834):0.231,(((t8:0.00299,t13:0.361):0.591,t9:0.129):0.148,t20:0.0146):0.182);")
dat1 = data.frame(y = c(0,1,0,0,1,1,1,1,0,0,1,1,0,1,1,1,1,1,1,1),
                  x = c(0,1,0,0,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1))
rownames(dat1) = c("t19","t17","t18","t16","t15","t3","t5","t11","t2","t1","t12","t10","t4","t14","t6","t7","t8","t13","t9","t20")
fit = phyloglm(y~x, phy=tre1, data=dat1) # all good
summary(fit)                             # all good

next: mask some data points to introduce missing data.

dat2 = dat1
dat2$x[3] = NA # t18 will need to be ignored
phyloglm(y~x, phy=tre1, data=dat2)

which reproduces the error message that @diogoprov had:

Error in phyloglm(y ~ x, phy = tre2, data = dat2) : 
  the number of rows in the data does not match the number of tips in the tree.

Quick bug fix:

  1. first obtain the list of species that have one or more missing values
    mf = model.frame(x~y, data=dat2)
    # model.frame removes rows with missing data, converts factors to dummy variables, etc.
    taxa_without_data = setdiff(tre1$tip.label, rownames(mf))

    now taxa_without_data has the list of problematic taxa, that will need to be pruned off the tree:

    > taxa_without_data
    [1] "t18"
  2. second: take the problematic taxa off the tree.
    tre2 = tre1 # good if there are no missing data
    if (length(taxa_without_data)>0) {
      tre2 = drop.tip(tre1, taxa_without_data) # drop taxa with missing data from the tree
    }

    now phyloglm should run fine, if the input tree has the problematic taxa pruned off:

    phyloglm(y~x, phy=tre2, data=dat2)      # all good, even though dat2 still has a row for "t18"
    phyloglm(y~x, phy=tre2, data=dat2[-3,]) # all good, same results
diogoprov commented 5 years ago

Thanks a lot for confirming this.

Best, Diogo