reneshbedre / bioinfokit

Bioinformatics data analysis and visualization toolkit
MIT License
340 stars 76 forks source link

Gene expression correlation analysis question #35

Closed ehinderer closed 3 years ago

ehinderer commented 3 years ago

I'm testing out a simple proof-of-concept where I want to use bioinfokit's stat.corr_mat to calculate gene correlations from raw expression counts. My dataframe has ~11K columns each representing a gene, and there are ~11K rows each representing a patient with raw RNA-seq read count values for every gene. With the dataframe setup this way, could I expect to see correlation values between genes across patient samples?

I understand that I may need to normalize the data for such cross sample comparisons, but I wanted to make sure that I understood the basic operation first. This example differs from the worked example in that I would like to visualize gene-gene expression correlations and not fold change-treatment correlations. Is this an appropriate use of corr_mat? If not, is there another function in bioinfokit that may do what I'm trying to visualize?

Thank you in advance!

reneshbedre commented 3 years ago

@ehinderer

To get the correlation values for genes across the patient sample using stat.corr_mat, you first need to transpose your matrix such that the genes will be at column and patient names will be rows. Once you transposed the matrix, you can run the stat.corr_mat function get the correlation matrix for the genes (gene-gene correlations).

The data normalization is optional and depends on what hypothesis you are evaluating.

If you are looking for gene co-expression network analysis, you can also use the WGCNA R package.

ehinderer commented 3 years ago

Thank you, this was mainly a sanity check for me. Appreciate it!