kordk / torch-ecpg

(GPU accelerated) eCpG mapper
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Implement save chunking for MLR #13

Closed liamgd closed 1 year ago

liamgd commented 1 year ago

Currently, for the regression_full function that runs the MLR, the compute time per methylation site and gene site increases dramatically as the function runs. This is mainly due to the concatenation operation time between each new row and the rest of the dataframe that runs for each methylation and gene site pair growing proportionately to the size of the input dataframes. As the output dataframe grows, the is operation becomes slower. Also, once this dataframe's size is too large to store in RAM, it must be stored on disk, which takes more time. To solve this issue, chunks of output can be saved separately, allowing for the chunk output dataframe never to exceed memory limits.

liamgd commented 1 year ago

Implemented in 72b7808.