Closed liamgd closed 1 year ago
Data: 300 samples, 10,000 gt rows, 10,000 mt rows
Mode | Constant Calculation (seconds) | Loop over methylation loci (seconds) |
---|---|---|
CUDA | 1.3899 | 22.233 |
CPU 1 thread | 0.0876 | 216.3914 |
Testing regression_full with line_profiler yields the following timings for 100 samples, 1000 methylation loci, and 1000 gene expression loci without chunking or p-value filtration. The size of the input gives a variety of different analysis results, and p filtration would reduce the creation of the output dataframe which is a major consumer of time.
By "The size of the input gives a variety of different analysis results", do you mean differences in performance (e.g., time to completion) and resources (e.g., memory or output file size) or the results of the analyses themselves (e.g., statistic and p-values)?
By that, I mean the ratio between the times spent per operation change a lot depending on the sample count and loci counts. For example, smaller inputs spend more time setting up the CUDA device and kernels whereas a large input with strict p-value filtration results in more relative time spent creating the filtration indices and mask.
Tests of regression_full.py with dummy data of 300 samples, 10,000 methylation loci, and 10,000 gene expression loci, no chunking, and p-thresh of 0.4 (filters all results so none are saved). About 1/202.195 of the GTP dataset.
CPU (i7-7700k with 4 threads) - 120.8198 s GPU (RTX 2070 Super) - 23.699 s
For small inputs, the CPU tends to perform better than the GPU, but at a certain point, the GPU become highly superior. I suspect that the after a breakeven point, the GPU becomes faster and faster relative to the CPU as the input size increases.
This test was for the computation only, and not the saving, as the p-threshold of 0.4 is higher than any mt p-value. It seems like the maximum mt p-value is around 3.9.
Thanks for the clarification. Once we've stabilized the code for the mlr and evaluated the reproducibility of the Kennedy 2018 analysis we'll evaluate the performance. This work will include an evaluation of the scaling performance (i.e., how fast the analyses completes with different numbers of samples and loci).
Is the new mlr code now the default?
It is not yet the default as it does not have region mapping. After that is complete, it will replace tecpg run mlr
.
Now that output inclusion is controlled in regression_full (from the last few commits), it should have all of the major functionality of regresion_single, and it can replace the tecpg run mlr
command. Here are the minor changes as a result of this:
tecpg run mlr-full
is now tecpg run mlr
tecpg run mlr
command is now tecpg run mlr-single
--no-est, --no-err, --no-t, and --no-p
for controlling the type of regression results to be included in the output of tecpg run mlr
, use --p-only or -P
to only include p-values in the output. Otherwise, all regression result types will be included.--regressions-per-chunk or -r
to control the number of regressions per chunk, use -l or --loci-per-chunk
to control the number of methylation loci included per chunk. For each methylation locus, all gene expression loci will be compared. For $l$ loci per chunk with $g$ total gene expression loci, $l \times g$ regressions will run. This is because regression_full operates on each methylation locus with all of the gene expression loci in parallel.Regression_full is now the default as of 986619e.
Excellent!
Given that users will have different GPU (and CPU) environments, and that chunking will essential for the user to run the analyses, can we provide guidelines for selecting the chunking size based on any one or combination of characteristics (e.g., number of samples, number of methylation loci, number of gx loci)? We don't need to be precise, rather just a general suggestion as a place for them to start and adapt to their own dataset.
Yes. The most important purpose of chunking is to avoid running out of memory. On CUDA GPUs, torch raises RuntimeError: CUDA out of memory
if each chunk is too large.
The limitations of this algorithm include that the user needs to provide the filtration coefficient and the algorithm does not account for CPU memory usage outside of torch (such as from numpy or pandas).
Run tecpg chunks
to get the maximum loci per chunk for a given target torch memory usage (default 80% of total memory). Use --filtration to specify what portion of data remains after region and p-value filtration or use -r false -p false
to specify that no region filtration or p value filtration will occur. Use -s [samples] -m [mt_count] -g [gt_count] -c 2
to estimate with that input size or omit these options to use the size of the data in the current working directory. Use -f [true/false]
and -P [true/false]
to filter what output modes are enabled (full output and p-value only, respectively). Omit these options to show estimates for all combinations of output modes.
Optimize the multiple linear regression to reduce algorithm run time and to utilize the CUDA device better.