HiCBricks offers user-friendly and efficient solutions for handling large high-resolution Hi-C datasets. The package provides a R/Bioconductor framework with the bricks to build more complex data analysis pipelines and algorithms.
Other
3
stars
4
forks
source link
Moving away from a multiple matrices in one hdf approach towards a single matrix in one hdf approach #9
Overview of changes included in this pull request for the HiCBricks development branch
This development branch was forked from HiCBricks master branch, version 1.3.3 as per the HiCBricks DESCRIPTION file. The objective for this development branch was to create an interface to add and manipulate resolutions through HiCBricks. The current changes will be reflected as HiCBricks version 1.6.0 after release signalling the major changes in the package.
The implementation of this particular objective required circumventing some major flaws in the Brick data structures and the HiCBricks workflow. Primarily, these are as follows:
I/O time scales with the size of an HDF file. Therefore, we had to make sure this did not become a caveat.
The logic behind not implementing resolutions before stems from the previous point.
A possible change in the Brick data structure would have been to use the non-zero upper triangle, similar to everyone in the field. Such an implementation will not allow the selection of diagonals and will not allow complex selection of sub-matrices from Brick stores. The latter is still a much aimed for goal.
To circumvent this issue and to make an attempt at removing the dependency between I/O time and HDF file size, I expedited the development of a future objective. To store each Hi-C data matrix as a single HDF file.
Store each Hi-C data matrix as a single HDF file.
Previously, we were storing both the upper and lower triangle. But now, we are storing only the upper triangle (chr2 >= chr1) in each HDF file.
I have also implemented a new S4 class, BrickContainer allowing users easy access to multiple resolutions without having to deal with the downtime which comes from having access to larger and larger HDF files.
Users can now make parallel calls to the HDF files, as each chromosome pair exists within its own HDF file.
Since we are only using the upper triangle, I had to rethink how the different matrix metrics were computed. Now, we have row metrics and column metrics which are computed separately.
This entire development has been complemented by the addition of a new set of functions, the BrickContainer set which implements various functionalities.
Other changes
Export to sparse format has been added.
The exec parameter has been removed. It was causing more problems than it was solving.
Sparse table loading functions have been added but are not exported since this code has not been tested.
BrickParallel functions have been added but in lieu of my conundrums regarding output formats, I have decided against exporting these functions.
Overview of changes included in this pull request for the HiCBricks development branch
This development branch was forked from HiCBricks master branch, version 1.3.3 as per the HiCBricks DESCRIPTION file. The objective for this development branch was to create an interface to add and manipulate resolutions through HiCBricks. The current changes will be reflected as HiCBricks version 1.6.0 after release signalling the major changes in the package.
The implementation of this particular objective required circumventing some major flaws in the Brick data structures and the HiCBricks workflow. Primarily, these are as follows:
Store each Hi-C data matrix as a single HDF file.
Other changes