hackseq / 2017_project_5

Developing advanced R tutorials for genomic data analysis
https://hackseq.github.io/2017_project_5/
MIT License
1 stars 2 forks source link

Integration of Genomic Analytic Results as a Multi-tab Excel Sheet with Package 'xlsx' #7

Open 5dPZ opened 7 years ago

5dPZ commented 7 years ago

What topic (e.g. R package) will you showcase? Mostly about the package 'xlsx'. I will be showing a way of integrating results from multiple genomic analyses or pipelines and produce a multi-tab excel sheet that stores all the data tables and figures in a single excel file (or loop the pipeline to generate multiple excel files)

Why do you think it's worthwhile learning about this topic/package? It is common that for at the end stage of genomic studies, one could arrive at a point with 50 genes of interest and ask...what do we know about these genes? what were the results of SNV or expresssion or pathway analyses I performed for each of these genes?
My goal is to generate a single excel file for each gene, containing all the analyses results about this gene, including data tables, figures, literature text-mining etc. as an integrated report.

It's important to state the motivation so it's clear to readers why they should read further. What dataset will you use? Preferably, pick a dataset that is relatively small. If the dataset is large, you can subset it to make it smaller (e.g. subset on chromosomes 20-22). Try to leverage datasets that are used elsewhere in this set of tutorials. It would be nice if there was a permalink for downloading the dataset. If a custom dataset is created (e.g. subsetting an existing large dataset), it would be great if the custom version was hosted somewhere (e.g. FigShare) so you can provide a permalink. I will be using the results generated by all the analyses we are doing. Or a list of genes significant from certain analyses we showcased.

What software dependencies need to be installed? R packages are usually easy to install, so it's okay to install a few R packages. Other command-line tools might be harder to set up on certain systems (e.g. Windows), so try to limit the number of external tool dependencies. Mostly the package '.xlsx', currently, there is not a very useful tutorial on this package on the web. The package 'animation' is sometimes used when some R processes automatically store a figure in .PDF format, which requires to be converted to .PNG to be imported into an excel sheet.

What will you cover in your tutorial? This roughly corresponds to an outline of what you will accomplish in your tutorial using the dataset you picked. How to create a workbook, worksheets, how to export results (data tables, figures, texts) onto the .xlsx.

zhenyisong commented 7 years ago

I do not like to use the Excel export. Excel can arise numerous data-transformation problems and it is difficult to find it out. However, most biologists are used to using it as the default. I give up using the xlsx package. Instead I use openxlsx, which is much faster to export data in my PC window setting. (personal experience). However, I have a great interest in your efforts to 'stores all the data tables and figures in a single excel file '. This will attract more biologists like me.

5dPZ commented 7 years ago

However, I have a great interest in your efforts to 'stores all the data tables and figures in a single excel file '. This will attract more biologists like me.

This is exactly the feature for the tutorial I am making. For result from a single anlaysis, .csv and .pdf are always superior than .xlsx. But the beauty of the .xlsx package is to store results from unlimited number of tables/figures from a pipeline in a multi-tab excel sheet format.

zhenyisong commented 7 years ago

It seems that the code you pushed to your folder is unfriendly readable. Can you make a little bit modification on you Rmd file. Thanks.

5dPZ commented 7 years ago

That will be what I will work on for the next two days. Currently it is the just raw R script I wrote for my own use. I will change it to a Rmd format tutorial and use results from our other analysis as data. Say, you guys made 4 tutorial on SNV or RNA-seq data analyses, my tutorial will be storing the results from your guys' analyses in a .xlsx file. Best,

5dPZ commented 7 years ago

I have pushed a minimal tutorial on GitHub. My original report contained more than 15 analyses. I dubbed it down to 3 analyses - a gene info card, a simple expression analysis, then an example of text mining. So the final out of the script is a .xlsx file with 3 tabs (4 actually, one is a stud).

zhenyisong commented 7 years ago

marvelous! I used to use this package. Now I choose openxlsx instead. I think xlsx package call the java library (rJava) and this always leads to huge memory consumption. I hope the openxlsx will also perform the same task in your demo code. your approach re-kindle my curiosity to show fully reproducible work to the traditional biologists. Thanks.

privefl commented 7 years ago

https://ropensci.org/technotes/2017/09/08/writexl-release/