jaspershen-lab / ipop_aging

Code for aging project
MIT License
26 stars 11 forks source link

Efficiency of calculating optimize_loess_span for new_expression_data #5

Open domjacri opened 2 months ago

domjacri commented 2 months ago

There is a chunk of code (below) that appears in many time. It appears to be (one of) the longer steps to reproducing the data. Is there any reason why we could not subset out features (1000 at a time?) and parallelize? I am considering doing for a dataset I have with whole transcriptome data (upwards of 20k features).

new_expression_data <-
  vector(mode = "list", length = nrow(variable_info))

for (i in seq_along(variable_info$variable_id)) {
  temp_variable_id <- variable_info$variable_id[i]
  cat(i, " ")
  temp_data <-
    data.frame(value = as.numeric(expression_data[temp_variable_id,]),
               sample_info)

  optimize_span <-
    optimize_loess_span(
      x = temp_data$adjusted_age,
      y = temp_data$value,
      span_range = c(0.4, 0.5, 0.6)
    )

  span <-
    optimize_span[[1]]$span[which.min(optimize_span[[1]]$rmse)]

  value <- temp_data$value
  adjusted_age <- temp_data$adjusted_age

  ls_reg <-
    loess(value ~ adjusted_age,
          span = span)

  prediction_value =
    predict(ls_reg,
            newdata = data.frame(adjusted_age = seq(30, 75, by = 0.5)))
  new_expression_data[[i]] <- as.numeric(prediction_value)
}
jaspershen commented 2 months ago

Thank you. I didn't optimize this, but you can do it in parallel, I think. You could divide your features into several blocks and then do them simultaneously.