grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

Running in paralle or multithreading option #290

Open susheelbhanu opened 4 years ago

susheelbhanu commented 4 years ago

Hey there..

Thanks the nice tool. Is there a way to run the final matrix in a parallel or multi-threaded manner. I have the following with a lot of rows (~ 1,248,624):

> print(obj$data$diff_table)
# A tibble: 1,248,624 x 7
   taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
   <chr>    <chr>       <chr>                   <dbl>       <dbl>     <dbl>          <dbl>
 1 aab      gp_1_Early  gp_2_Early              0         0        0.0278            0.803
 2 aac      gp_1_Early  gp_2_Early              0         0        0               NaN    
 3 aad      gp_1_Early  gp_2_Early              0.517     0.277    0.206             0.183
 4 aae      gp_1_Early  gp_2_Early              1.12      0.0107  -0.0715            0.832
 5 aaf      gp_1_Early  gp_2_Early              0         0       -0.000119          0.516
 6 aag      gp_1_Early  gp_2_Early              0         0       -0.00519           0.191
 7 aah      gp_1_Early  gp_2_Early              0         0       -0.00531           0.167
 8 aai      gp_1_Early  gp_2_Early              0         0       -0.0452            0.146
 9 aaj      gp_1_Early  gp_2_Early              0.721     0.00421 -0.0647            0.964
10 aak      gp_1_Early  gp_2_Early           -Inf        -0.0300   0.0313            0.256
# … with 1,248,614 more rows

And i'm trying to plot the final figure using the code below:

heat_tree_matrix(obj,
                 data = "diff_table",
                 node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
                 node_label = taxon_names,
                 node_color = log2_median_ratio, # A column from `obj$data$diff_table`
                 node_color_range = diverging_palette(), # The built-in palette for diverging data
                 node_color_trans = "linear", # The default is scaled by circle area
                 node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                 edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                 node_size_axis_label = "Number of ASVs",
                 node_color_axis_label = "Log2 ratio median proportions",
                 layout = "davidson-harel", # The primary layout algorithm
                 initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
                 output_file = "differential_heat_tree.pdf", # Saves the plot as a pdf file      
                 key_size = 0.4, # adjust the size of the "key" tree with respect to the plot
                 row_label_size = 16, col_label_size = 16,
                 node_label_size_range = c(0.01,0.05) # node font size, see here:https://github.com/grunwaldlab/metacoder/issues/245
                  )

It's been running for 69h 51m so far and still no end in sight. So having a multi-thread option may help with the time.

Thanks a lot, Susheel

zachary-foster commented 4 years ago

Hello, sorry for the delay. There is no such option, but it would be a useful addition. I will look into it when I get a chance. 70h sounds really long, even for a large dataset. I would recommend the following ways to optimize it:

susheelbhanu commented 4 years ago

Hey @zachary-foster Thanks a lot for the pointers. Yeah, it did run for longer than that, and eventually finished though - so all in all, a good exercise. I did trim a lot of them out, but it was one of those datasets with a lot of samples and groups to begin with.

Will certainly try the multiple plots option. It was mostly running on RAM so not an issue there. Given the nearly 1 million lines to parse, I figured it must have been the case.

While I'm here though, it would also be beneficial if there was a way for one to adjust the legend outside of the plotting function itself? Especially w.r.t. the font sizes etc. 'Cos If i'd want to adjust the legend size, I'd have to run the full plot and wait till I see if it worked, right? Or is there an alternative?

Thanks again!