nicolaroberts / hdp

R pkg for Hierarchical Dirichlet Process
Other
73 stars 31 forks source link

Can HDP Handle Mixed Input of Count Data and Binary Categorical Data? #7

Closed lincj1994 closed 4 weeks ago

lincj1994 commented 1 month ago

Hi @NickWilliamsSanger.

I would like to confirm if HDP can accept a data frame containing both count data and binary categorical data as input.

Here’s an example of my data:

Sample data:

> head(df[, 1:6])
        TMB    GeneA   GeneB    GeneC GeneD GeneE
ID1        9            1            0            0    0           0
ID2        2            0            0            0    1           0
ID3        4            1            0            0    0           0
ID4        4            1            0            0    0           0
ID5        4            0            0            1    0           0
ID6       10            0            1            0    0           0

My dataset consists of over 6,000 samples and 400 genes, structured similarly to the example above. However, after running the following code, the output resulted in only two categories:

n <- ncol(df)
genes <- colnames(df)    # ~ 400
nSample <- nrow(df)       # ~ 4k
shape <- rep(1, 2) 
invscale <-rep(1, 2)
ppindex <- c(0, rep(1, nSample))
cpindex <- c(1, rep(2, nSample))

hdp <- NULL
hdp <- hdp_init(ppindex, cpindex, hh=rep(1/n,n), alphaa=shape, alphab=invscale)

hdp <- hdp_setdata(hdp = hdp, 2:numdp(hdp), df)

burnin <- 30000
postsamples <- 4000
spacebw <- 20
cpsamples <- 10

chlist <- vector("list", 4)
for (i in 1:4){
  activated_hdp <- dp_activate(hdp, dpindex = 1:numdp(hdp), initcc = 2, seed = i*1e2)
  chlist[[i]] <- hdp_posterior(activated_hdp, 
                               burnin = burnin,
                               n = postsamples,
                               space = spacebw,
                               cpiter = cpsamples, 
                               seed = i*1e4)
}

output_multi <- hdp_multi_chain(chlist)
output_multi2 <- hdp_extract_components(output_multi)
output_multi2

Object of class hdpSampleMulti 
 Number of chains: 4 
 Total posterior samples: 16000 
 Components: YES. Prop of data explained = 0.942  Merged if cosine sim > 0.9 
 Component number: 1 
 ----------
 Final hdpState from first chain: 
Object of class hdpState 
 Number of DP nodes: 6506 
 Index of parent DP: 0 1 1 1 1 1 1 1 1 1 ...
 Number of data items per DP: 0 20 6 9 10 10 22 14 7 10 ...
 Index of conparam per DP: 1 2 2 2 2 2 2 2 2 2 ...
 Conparam hyperparameters and current value:
           Shape Rate     Value
Conparam 1     1    1 0.2044487
Conparam 2     1    1 3.2352546
 Number of data categories: 456 
 Number of clusters: 3 
 Initialised with 2 clusters, using random seed 100 

Is this behavior expected? Does HDP support this kind of mixed data input, or are there any code errors I should be aware of?

Thank you for your help!

NickWilliamsSanger commented 4 weeks ago

Hi @lincj1994

I think only counts data is supported. The usage of the "data" argument is specified in ?hdp_setdata as: "A data.frame or matrix of counts with one row for every sample (same order as dpindex) and one column for every data category."

Note, I've only made very superficial changes to "hdp" so I'm not really in a position to support it.

Nick