parvanehnikpour2024 / Parvaneh_course_Project

Python course-Fall 2024, Course project repository
GNU General Public License v3.0
0 stars 0 forks source link

RNA-seq Data Analysis Script

Overview

This script performs analyses on bulk RNA-seq data, specifically focusing on merging RNA-seq count files from two separate runs, followed by optional sample renaming and the application of two dimensionality reduction methods: Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). These methods allow for the visualization and interpretation of high-dimensional gene expression data.

The script consists of three main steps:

  1. Data Preparation: Merging and optional renaming of RNA-seq data files.
  2. Principal Component Analysis (PCA): A linear dimensionality reduction method to explore sample variability.
  3. Uniform Manifold Approximation and Projection (UMAP): A non-linear dimensionality reduction method that uncovers complex patterns in gene expression data.

Prerequisites

Software Requirements

Conda Environment

Make sure you have created a Conda environment with the required packages. You can set it up using the environment.yml file included in the repository.

To create the environment, run:

conda env create -f environment.yml

Input Files

Script Workflow

1. Data Preparation: Merging and Renaming

The script begins by merging two separate RNA-seq count data files (from run1.xlsx and run2.xlsx) based on their common gene symbols. Missing values are filled with zeros. After merging, the script optionally renames specific sample columns, as defined by a dictionary (rename_dict). The final merged and renamed dataset is saved to an Excel file for further analysis.

Key Functions:

2. Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction technique used to identify major sources of variation in gene expression data. This step involves:

Key Functions:

3. Uniform Manifold Approximation and Projection (UMAP)

UMAP is a non-linear dimensionality reduction method that preserves complex relationships in the data. The script applies UMAP to the RNA-seq data, generating a 2D embedding (UMAP1 and UMAP2), which is then merged with the sample metadata for visualization.

Key Functions:

4. Visualization

Both PCA and UMAP results are visualized as scatter plots, with different colors and markers representing disease types, differentiation states, and RNA-seq runs. Sample labels are added to each point for easier identification.

The resulting plots are saved as JPEG images (PCA_plot.jpeg and UMAP_plot.jpeg), and displayed within the script.

Output Files

Usage Instructions

  1. Ensure the input files (run1.xlsx, run2.xlsx, and Samples_metadata.xlsx) are in the same directory as the script.
  2. Run the script in a Python environment using:
python RNA_seq_analysis_final.py

The script will output two Excel files and two visualizations, saved in the current working directory.

Expected Results

In the test data provided, we expect that samples with IPS state will group together, while NPC and NRN states will group all together.

Example Output

Additional Notes

Additional Information