This script performs analyses on bulk RNA-seq data, specifically focusing on merging RNA-seq count files from two separate runs, followed by optional sample renaming and the application of two dimensionality reduction methods: Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). These methods allow for the visualization and interpretation of high-dimensional gene expression data.
The script consists of three main steps:
Make sure you have created a Conda environment with the required packages. You can set it up using the environment.yml
file included in the repository.
To create the environment, run:
conda env create -f environment.yml
run1.xlsx
and run2.xlsx
) containing RNA-seq counts with gene symbols and sample information.Samples_metadata.xlsx
) containing metadata for each sample, such as disease type, differentiation state, and RNA-seq run.The script begins by merging two separate RNA-seq count data files (from run1.xlsx
and run2.xlsx
) based on their common gene symbols. Missing values are filled with zeros. After merging, the script optionally renames specific sample columns, as defined by a dictionary (rename_dict
). The final merged and renamed dataset is saved to an Excel file for further analysis.
Key Functions:
pd.merge()
function.combined_run1run2.xlsx
and renamed_combined_run1run2.xlsx
).PCA is a linear dimensionality reduction technique used to identify major sources of variation in gene expression data. This step involves:
StandardScaler
.Key Functions:
PCA()
from sklearn.decomposition
to compute principal components.sns.scatterplot()
from seaborn
to visualize PCA results.UMAP is a non-linear dimensionality reduction method that preserves complex relationships in the data. The script applies UMAP to the RNA-seq data, generating a 2D embedding (UMAP1 and UMAP2), which is then merged with the sample metadata for visualization.
Key Functions:
umap.UMAP()
to create a UMAP model and transform the RNA-seq data into lower dimensions.sns.scatterplot()
to visualize UMAP results.Both PCA and UMAP results are visualized as scatter plots, with different colors and markers representing disease types, differentiation states, and RNA-seq runs. Sample labels are added to each point for easier identification.
The resulting plots are saved as JPEG images (PCA_plot.jpeg
and UMAP_plot.jpeg
), and displayed within the script.
combined_run1run2.xlsx
(before renaming) renamed_combined_run1run2.xlsx
(after renaming)PCA_plot.jpeg
(PCA result) UMAP_plot.jpeg
(UMAP result)run1.xlsx
, run2.xlsx
, and Samples_metadata.xlsx
) are in the same directory as the script.python RNA_seq_analysis_final.py
The script will output two Excel files and two visualizations, saved in the current working directory.
In the test data provided, we expect that samples with IPS state will group together, while NPC and NRN states will group all together.
rename_dict
dictionary.