hackseq / 2017_project_5

Developing advanced R tutorials for genomic data analysis
https://hackseq.github.io/2017_project_5/
MIT License
1 stars 2 forks source link

Showing how to use R & memory-mapping to analyze data encoded as large matrices #4

Open privefl opened 7 years ago

privefl commented 7 years ago

For multiple genomic data, most of the information can be stored as matrices. The most striking example is with SNP data, which can be stored as matrices with thousands to hundreds of thousands of rows (samples) with hundreds of thousands to dozens of millions of columns (SNPs) (Bycroft et al. 2017). This results in datasets of GygaBytes to TeraBytes of data.

Other fields in genomics, such as proteomics or expression data, use data stored as matrices potentially of size larger than available memory.

To address large data size in R, we can use memory-mapping for accessing large matrices stored on disk instead of in RAM. This has existed in R for several years thanks to package bigmemory (Kane, Emerson, and Weston 2013).

More recently, two packages which use the same principle as bigmemory have been developed: bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). Package bigstatsr implements many statistical tools for several types of Filebacked Big Matrices (FBMs), making it usable for any type of genomic data that can be encoded as a matrix. The statistical tools in bigstatsr include implementation of multivariate sparse linear models, Principal Component Analysis (PCA), matrix operations, and numerical summaries. Package bigsnpr implements algorithms which are specific to the analysis of SNP arrays, making use of already implemented features in package bigstatsr.

In this small tutorial, we’ll see the potential benefits of using memory-mapping instead of standard R matrices in memory, by using bigstatsr and bigsnpr.


You can find the first version of the tuto there.

zhenyisong commented 7 years ago

I think this definitely should include imaging data. R have interface to parse the image data whatever format is. Imaging data from the confocal equipment (or other modern microscope) are huge and complex to process.

privefl commented 7 years ago

Have you some example data? Is this stored as matrices?

zhenyisong commented 7 years ago

No. I plan to use R to process imaging data, including fMRI, in the near future. Imaging data for sure is matrix and we can use our linear algebra knowledge to deal with it. But our current task seems to have no mention of this type of analysis. And here is an interesting link.

zhenyisong commented 7 years ago

Great. I absorbed a lot from your elegant code. I know Paris from his work, America Chef. And social etiquette in Paris < The Sweet Life in Paris> in his book.