ma-compbio / scGHOST

single-cell Hi-C, scHi-C, Hi-C, 3D genome, nuclear organization, genome subcompartment
MIT License
18 stars 2 forks source link
3d-genome machine-learning single-cell

Overview of scGHOST

Overview of scGHOST

scGHOST is an unsupervised single-cell subcompartment annotation method based on graph embedding with constrained random walk sampling. scGHOST is designed to be run on a single-cell Hi-C (scHi-C) dataset which has undergone imputation by Higashi (Zhang et al. 2022). scGHOST assigns embeddings to genomic loci in the genomes of individual cells by viewing scHi-C as graphs whose vertices are genomic loci and edges are the contact frequencies among loci. While scGHOST is developed for scHi-C data, it can also identify single-cell subcompartments in single-cell genome imaging data.

Running scGHOST

Input data

scGHOST uses the outputs from Higashi as its inputs. Specifically, it requires the scHi-C imputations (hdf5 format), per-cell embeddings (numpy format), sparse raw scHi-C adjacency maps (numpy format), the scA/B scores (hdf5 format), and the label info file (pickle format) describing the cell types corresponding to each cell in the dataset.

Installation

Before installing any Python packages, we strongly recommend using Anaconda (please refer to the Anaconda webpage for conda installation instructions) to create a python 3.10 environment using the following command:

conda install --name scghost python=3.10

After creating the environment, activate it using:

conda activate scghost

Dependencies

Conda installations

Users can install scGHOST dependencies using the conda or pip commands following the specifications above.

Systems without a CUDA-capable GPU can also install scGHOST using the same dependencies and installing PyTorch for CPU only, but will have to modify the source code in modules/clustering.py to use SKMeans instead of KMeans under the scghost_clustering_reworked function. We may add a flag in the config file to run CPU only instead, but from our experience running scGHOST on the CPU only takes far longer than on a GPU and is not recommended.

Hardware Requirements

scGHOST can use up to 40 GB of memory for a single-cell dataset of 4,238 cells. Considering operating system overhead, we recommend running scGHOST on a machine with at least 64 GB of memory to avoid poor performance or out-of-memory errors at runtime.

scGHOST was developed on a system with a 12-core 12th generation Intel CPU, an Nvidia RTX 3090 GPU with 24GB of VRAM, and 64GB of system memory. With GPU caching enabled, scGHOST uses a maximum of 15 GB of VRAM on the PFC dataset. With GPU caching disabled, VRAM becomes less of a limiting factor and scGHOST should run on any CUDA-capable GPU with at least 4 GB of VRAM.

Usage

Users can run scGHOST using the following command:

python scghost.py --config <configuration.json>

Sample JSON config files for scGHOST have been provided.

configuration is the filepath to a custom configuration file adhering to the JSON format for scGHOST. By default, scGHOST uses the included config.json file, which can be modified to the user's specifications.

Note: users may run into a RuntimeWarning after the clustering step. This is normal behavior and should not affect the accuracy of results.

Runtime

scGHOST was run on a machine with a 12-core 12th generation Intel CPU and Nvidia RTX 3090 24GB GPU. From scratch, scGHOST takes about 2 hours to run on the sciHi-C GM12878 dataset and about 4 hours to run on the human prefrontal cortex dataset.

Configuration file

Tutorials

Please follow our tutorial notebooks in the root directory for examples on how to run scGHOST with and without first running Higashi. For a sample run of scGHOST, users can download the smaller WTC-11 dataset here. After downloading the sample data, please change the sample_configs/config_wtc.json configuration file accordingly to point to the correct paths and run the following command:

python scghost.py --config sample_configs/config_wtc.json

Contact

Please email jianma@cs.cmu.edu or raise an issue in the github repository with any questions about installation or usage or any encountered bugs.