- About Xenomnake
1.1. PDX Model
1.2. Spatial Integration
1.3. Pipeline Overview
1.4. Software Dependencies
1.5. Publication
- Installation
2.1. Create conda environment and build dependencies
- Running Xenomake
3.1. QuickStart
3.1.2 Memory Requirements and Runtimes
3.2. Step1: Xenomake Init
3.2.1. Description
3.2.2. Flags
3.2.3. Command Line Implementation
3.2.4. Output
3.3. Step2: Xenomake Species
3.3.1 Description
3.3.2 Flags
3.3.3. Command Line Implementation
3.3.4. Output
3.4 Step3: Xenomake Config
3.4.1 Decsription
3.4.2 Flags
3.4.3 Command Line Implementation
3.5. Step3: Run Snakemake Pipeline
3.5.1. Snakemake Description
3.5.2. Dry Run
3.5.3. Execute Snakemake Workflow
3.5.4. Snakemake Standard Outs
3.5.5. Xenomake Output Structure
- Downstream Cell-Cell Interactions
- QC
5.1. Scanpy QC Plots
5.2. Xengsort Metrics
5.3. STAR Metrics
5.4. UMI Metrics
- Test Dataset
6.1. Download Data
6.2. Run Xenomake Pipeline
- Optional Downloads
7.1. Dropseq
7.2. Picard
About Xenomake
Processing pipeline for Patient Derived Xenograft (PDX) Spatial Transcriptomics Datasets
PDX Model
Patient-derived xenografts are a widely used component of cancer research
These models consist of grafting a human tumor sample into mice to study interactions between a patient's tumor and stroma, screening compounds that target a patient's tumor, and simulating human tumor biology
Spatial Integration
With the rising popularity of spatially resolved transcriptomic technologies, there is a growing need for processing pipelines that can handle reads from PDX samples. Spatial processing has the potential to analyze tumor/stroma interactions and tumor microenvironments.
Pipeline
To facilitate the adoption of spatial transcriptomics for PDX studies, we thus have developed a pipeline "Xenomake", which is an end-to-end pipeline that includes read alignment and gene quantification steps for reads generated by spatial transcriptomic platforms and uses a xenograft sorting tool (Xengsort) to apportion xenograft reads to the host and graft genomes.
Xenomake's structure is based on Snakemake and includes scripts to handle multi-mapping, downstream analysis, and handling "ambiguous" reads
NEW: Addition of single cell compatibility with 10x workflow
Software Dependencies
Xenomake utilizes a number of pre-existing tools for our pipeline and would like to thank works from these groups. We encourage users to read through these tools to understand how our pipeline is built.
Publication
Xenomake: a pipeline for processing and sorting xenograft reads from spatial transcriptomic experiments
Benjamin S Strope, Katherine E Pendleton, William Z Bowie, Gloria V Echeverria, Qian Zhu
bioRxiv 2023.09.04.556109; doi: https://doi.org/10.1101/2023.09.04.556109
Installation
Create the conda environment and build dependencies
# Create conda environment with the provided environment file
conda env create -n xenomake -f environment.yaml
# Activate conda environment
conda activate xenomake
#install xenomake
pip install xenomake
Running Xenomake
Quickstart
Quickstart: Initialize Xenomake:
#Initialize Xenomake
xenomake init \
-r1 <read1> \
-r2 <read2> \
-s <sample name> \
-o <output directory> \
--spatial_mode <preset spatial mode [visium,dbit-seq,seq-scope,slide-seq,hdst] or custom> \
--run_mode <preset run mode [lenient,prude] or custom> \
Quickstart: Add Species Information:
# Print Help Function Call
xenomake species --help
# Create species folder
xenomake species \
-mr <mouse_reference_assembly.fa> \
-hr <human_reference_assemble.fa> \
-ma <mouse_annotation.gtf> \
-ha <human_annotation.gtf> \
Quickstart: Spatial Configuration:
If default run mode and spatial mode, this step can be skipped
If 'custom' run mode or spatial mode, you must input the following parameters
# Print Help Function Call
xenomake config --help
# Create configuration file
xenomake config \
--barcode_file <barcode whitelist> \
--umi <umi structure> \
--cell_barcode <cell barcode structure> \
--ambiguous <re-partition ambiguous reads from xenograft sorting> \
--downstream <perform downstream data processing> \
--genic_only <only count genic reads (i.e., no UTR, intergenic,intronic> \
--mm_reads <if read is mult-imapped, choose most likely position>
Quickstart: Run Snakemake Pipeline
# Dry Run
xenomake run -n
# Initialize Xenomake Pipeline (species dir and config.yaml need to be created)
xenomake run \
--cores <n_threads>
Memory Requirements and Runtimes
Process |
Threads |
Time |
RAM Recommended |
Xengsort Index |
8 |
33 min |
25 Gb |
STAR Index |
8 |
~90 min per genome |
128 Gb |
Xenomake Pipeline (Test Dataset: 8.4m reads) |
8 |
37 min |
35 Gb |
Xenomake Pipeline (Test Dataset: 8.4m reads) |
4 |
42 min |
35 Gb |
Xenomake Pipeline (Medulloblastoma: 88m reads) |
8 |
300 min |
35 Gb |
Step 1: Project Initialization:
This is a REQUIRED step to initialize your implementation of the xenomake pipeline
Description
xenomake init takes information about your reads, run modes, spatial chemistry, and sample/project names to create a configuration file for your project
Run Modes
===========
Run modes refer to how reads are handled during mapping and xenograft sorting portions of the pipeline. Options are the following:
1. lenient: xenograft ambiguous re-mapped (True), multi-mapped re-aligned (True), only genic reads (True)
2. stringent: xenograft ambiguous re-mapped (False), multi-mapped re-aligned (False), only genic reads (True)
3. custom: User Defined
***Spatial Modes***
===============
Spatial modes refer to the spatial technology used to generate reads
1. visium
2. slide-seq
3. dbit-seq
4. hdst
5. sc10x
6. seq-scope
7. custom: User Defined barcode structure, umi structure, spot size, slide size, barcode file, number of beads/spos
### Required Flags:
- **--read1:** paired-end read in fastq format
- **--read2:** second paired-end read
- **--outdir:** name of project directory where the output file will be written
- **--sample:** name of sample to prepend filenames and create sample directory within the project directory
- **--spatial_mode:** preset run modes based upon spatial method used [custom,visium,sc10x,slide-seq,hdst,seq-scope,dbit_seq]
- **--run_mode:** preset run modes based upon read processing. This controls multi-mapped, ambiguous, and non-genic read handling [custom, stringent, lenient]
### Optional Flags:
- **--root_dir**: directory of project execution (default: ./)
- **--temp_dir**: temporary directory (default: /tmp)
- **--picard**: path to user preferred picard.jar tool (default: tools/picard.jar)
- **--dropseq_tools**: path to user preferred download of dropseq tools directory (default: tools/Dropseq_2.5.1)
- **--download_species**: download mm10 and hg38 genomes and annotations to the current directory (default: False)
- **--download_index**: download star indices and xengosort index to the current directory (default: False)
### Command Line Implementation:
```
# Print Help Function Call
xenomake init --help
#Initialize Project
xenomake init \
-r1 \
-r2 \
-o \
-s \
--spatial_mode \
--run_mode
```
### Output Message
```
sample name: A1
spatial chemistry: visium
run mode: lenient
project "downsampled" initialized, proceed to species setup and project execution
```
### Example Configuration File
![config.yaml](https://github.com/istrope/Xenomake/blob/main/figures/config.png)
## Step 2: Xenomake Species
This is a **REQUIRED** step to initialize your implementation of the xenomake pipeline
### Description
*xenomake species* links all inputs in a new directory structure that is then easily referenced in project run
### Required Flags:
- **--mouse_ref:** Mouse Reference Assembly in fasta format
- **--human_ref:** Human Reference Assembly in fasta format
- **--mouse_annotation:** Mouse Genome Annotation File in .gtf format
- **--human_annotation:** Human Genome Annotation File in .gtf format
### Optional Flags:
Including optional flags for indices will skip parts in the pipeline such as *STAR --runMode CreateIndex* and *xengsort index*
These two rules can be time-consuming and if these files are already generated, it doesn't need to be done again.
- **--human_index:** STAR index directory for human genome reference
- **--mouse_index:** STAR index directory for mouse genome reference
- **--xengsort_hash:** Xengsort hash table filepath to be used in xengsort classify step
- **--xengsort_info:** Xengosrt info table filepath to be used in xengsort classify step
### Command Line Implementation:
```
# Print Help Function Call
xenomake species --help
# Create species folder
xenomake species \
--mouse_reference \
--human_reference \
--mouse_annotation \
--human_annotaion \
```
### Output Message
```
species directory successfully completed, keep project execution in the current directory
```
### Output Directory Structure
**├─ species**
│ ├── idx.hash
│ ├── idx.info
**│ ├── human**
│ │ ├── genome.fa
│ │ ├── annotation.gtf
│ │ ├── star_index/
**│ ├── mouse**
│ │ ├── genome.fa
│ │ ├── annotation.gtf
│ │ ├── star_index/
## Step 3: Xenomake Config (custom run)
### Description
Only necessary ***if you selected custom*** for either *run_mode* or *spatial_mode* as a part of xenomake init
### Spatial Flags
- **--barcode_file**:
- **--umi**:
- **--cell_barcode**:
### Run Flags
- **downstream**:
- **mm_reads**:
- **genic_only**:
- **--ambiguous**:
### Command Line Implementation:
if ***spatial_mode***: custom
```
xenomake config \
--barcode_file \
--umi \
--cell_barcode \
```
if ***run_mode*** : custom
```
xenomake config \
--downstream \
--mm_reads \
--genic_only \
--ambiguous
```
## Step 4: Execute Project
### Snakemake
Xenomake uses the Snakemake workflow management system to create reproducible and scalable data analyses
To get an understanding of snakemake, please visit: https://snakemake.github.io/
For additional information, read Snakemake's paper https://doi.org/10.12688/f1000research.29032.1
### Dry Run
It is recommended to first execute a dry-run with flag **--dry-run**, Xenomake will only show the execution plan instead of performing steps.
### Flags:
**-cores**: number of cores to use for project run
**--dry-run**: Dry-Run showing the workflow
**--rerun-incomplete**: Force to rerun incompletely generated files if project halted
**--keep-going**: If a job fails, keep executing independent jobs that don't rely on its output
```
#Dry Run
xenomake run --cores --dry-run
#Run Project
xenomake run --cores --keep-going
```
### Example Snakemake Standard Outs
Below shows an example of the messages printed by Snakemake. Briefly, they give an overview of the jobs in line for execution and print messages when a rule is being performed. This includes a few important features:
1. Start time
2. Finish time
3. Rule name
4. Input files
5. Output file destinations
6. Wildcard values
7. Additional params and log destinations
Snakemake will also print any standard-out messages from tools performed. We have saved some of these as log files for subsequent review
```
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 DGE_Human
1 Expression_QC
...
1 Subset_Overlapped
1 Xengsort_Clasify
1 all
28
Select jobs to execute...
[Fri Sep 8 10:36:04 2023]
rule Preprocess:
input: fastq/sub1.fq, fastq/sub2.fq
output: out/treated/preprocess/unaligned_bc_umi_tagged.bam, out/treated/logs/Preprocess.summary
log: out/treated/logs/Preprocess.log
jobid: 6
wildcards: OUTDIR=out, sample=treated
[Fri Sep 8 10:36:06 2023]
Finished job 6.
1 of 28 steps (4%) done
Select jobs to execute...
...
...
...
[Fri Sep 8 10:47:41 2023]
localrule all:
input: out/treated/final/treated_human_counts.tsv.gz, out/treated/final/treated_mouse_counts.tsv.gz, out/treated/final/treated_human.hdf5, out/treated/final/treated_mouse.hdf5, out/treated/final/treated_human.h5ad, out/treated/final/treated_mouse.hdf5, out/treated/final/treated_mouse.h5ad, out/treated/final/treated_human_processed.h5ad, out/treated/final/treated_mouse_processed.h5ad, out/treated/qc/human_qc.png
jobid: 0
threads: 16
[Fri Sep 8 14:37:01 2023]
Finished job 0.
29 of 29 steps (100%) done
```
### Output Directory Structure
├─ **OUTDIR**
**│ ├── sample**
│ │ ├── final.bam: Mapped, Sorted, Xenograft Processed alignments in bam format
**│ │ ├── final**
│ │ │ ├── tissue_positions_list.csv
│ │ │ ├── {sample}_counts.tsv.gz: *Gene by Barcode umi count matrix generated by dropseq DigitalGeneExpression*
│ │ │ ├── {sample}_human.h5ad:*Spatial Anndata object*
│ │ │ ├── {sample}_human.hdf5: *10X genomics formatted output structure (can be used for downstream spatial processing)*
**│ ├── xengsort:** *Xenograft sorted fastq files generated using xengsort toolkit*
│ │ │ ├── ambiguous.fq.gz
│ │ │ ├── both.fq.gz
│ │ │ ├── graft.fq.gz
│ │ │ ├── host.fq.gz
│ │ │ ├── neither.fq.gz
**│ ├── mapping**
│ │ │ ├── STAR aligned genomes: *Aligned files without Xenograft Processing*
**│ ├── logs**
│ │ │ ├── *Tagging and trimming logs*
│ │ │ ├── *STAR log files*
│ │ │ ├── *BAM subsetting logs*
│ │ │ ├── *Xengsort log*
│ │ │ ├── *Multi-map filtering logs*
│ │ │ ├── *Digital Gene Expression logs*
**│ ├── qc**
│ │ │ ├── *Xengsort summary files*
│ │ │ ├── *STAR summary files*
│ │ │ ├── *Dataset qc plots*
│ │ │ ├── *Digital Gene Expression summary*
**│ ├── preprocessing**
│ │ │ ├── unaligned_bc_umi_tagged.bam: Unaligned bam file. Cell-barcode/UMI tagged and Adapter/PolyA trimmed using dropseq workflows
# Downstream Cell-Cell Interactions
## Intro
Using the outputs from Xenomake, we were able to identify group specific expression and differential signalling genes in our model. We provide scripts in the repository `CellInteract` to perform identical analyses that are present in the Xenomake Publication
## Cell-Cell Signalling
Cell Signalling molecules such as chemokines/cytokines are simultaneously secreted by both stromal cell types and epithelial tumor cells in the tumor microenvironment. It is often difficult to attribute signalling expression to stromal or tumor cells in standard models. However in PDX models processed with means similar to Xenomake, we are able to study the differential production of cellular signals and identify biomarkers in stromal and epithelial compartments
## Xenomake Publication Figure (Organism Specific Signalling)
![Xenomake Plot](https://github.com/istrope/Xenomake/blob/main/figures/human_mouse_split.jpg)
## Repository
Access further description as well as scripts to perform these downstream analysis:
`Link to CelllInteract Repository`: https://github.com/bernard2012/CellInteract
# QC
## Scanpy QC Plots
![QC Metrics](https://github.com/istrope/Xenomake/blob/main/figures/scanpy_qc.png)
## Xengsort Metrics
Counts number of reads partitioned to each file
| prefix | host | graft | ambiguous | both | neither |
|--------|---------|----------|-----------|---------|---------|
| test | 6652723 | 13654867 | 2603727 | 1297366 | 154218 |
```
## Running time statistics
- Running time in seconds: 225.0 sec
- Running time in minutes: 3.75 min
- Done. [2023-08-17 16:19:49]
```
## STAR Metrics
```
Started job on | Aug 16 18:57:26
Started mapping on | Aug 16 18:58:54
Finished on | Aug 16 19:04:22
Mapping speed, Million of reads per hour | 968.48
Number of input reads | 88239353
Average input read length | 108
UNIQUE READS:
Uniquely mapped reads number | 36755354
Uniquely mapped reads % | 41.65%
Average mapped length | 109.05
Number of splices: Total | 7555478
Number of splices: Annotated (sjdb) | 7421066
Number of splices: GT/AG | 7487922
Number of splices: GC/AG | 23361
Number of splices: AT/AC | 1030
Number of splices: Non-canonical | 43165
Mismatch rate per base, % | 1.75%
Deletion rate per base | 0.04%
Deletion average length | 1.90
Insertion rate per base | 0.04%
Insertion average length | 1.89
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 10085928
% of reads mapped to multiple loci | 11.43%
Number of reads mapped to too many loci | 768576
% of reads mapped to too many loci | 0.87%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 38624549
% of reads unmapped: too short | 43.77%
Number of reads unmapped: other | 2004946
% of reads unmapped: other | 2.27%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
```
## UMI Metrics
Example of top 4 spots with highest expression
| CELL_BARCODE | NUM_GENIC_READS | NUM_TRANSCRIPTS | NUM_GENES |
|------------------|-----------------|-----------------|-----------|
| CAGTTCCGCGGGTCGA | 429279 | 67931 | 9287 |
| GAAACTCGTGCGATGC | 398339 | 59154 | 8748 |
| AAGAGATGAATCGGTA | 362017 | 57447 | 8723 |
| CCAAGAAAGTGGGCGA | 367515 | 54742 | 8306 |
# Run Xenomake on Medulloblastoma Test Dataset
Follow Installation Instructions for Xenomake if not done already
## Dataset
Paired End Fastq Files Downsamples to 8.4 million reads
Taken from: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-023-01185-4
### Download Dataset
```
wget https://zenodo.org/records/10655403/files/sub1.fq.gz
wget https://zenodo.org/records/10655403/files/sub2.fq.gz
```
## Run Xenomake Pipeline
### Initialize Project
```
xenomake init \
-r1 sub1.fq.gz \
-r2 sub2.fq.gz \
-s downsampled \
-o test_run \
--run_mode lenient \
--spatial_mode visium \
--download_genome True \
--download_index True \
--download_xengsort True
```
###
### Species Setup
```
xenomake species \
-hr GRCh38.p14.fna.gz \
-ha GRCh38.p14.gtf \
-mr GRCm39.fna.gz \
-ma GRCm39.gtf \
--human_index human_index/ \
--mouse_index mouse_index/ \
--xengsort_hash idx.hash \
--xengsort_info idx.info
```
### Run Project
```
xenomake run --cores 4
```
## Optional Downloads
### Download Dropseq
Current version of Dropseq-tools is 2.5.3 and is present at /tools/Dropseq-2.5.3/
For updated versions check:
https://github.com/broadinstitute/Drop-seq
### Download Picard
#### Version Control
The provided picard.jar file is currently compatible with Java installed in the Xenomake conda environment.
Java: v11.0.15
Picard: v2.27.5
#### Picard Github
If you choose to use your own Picard executable, ensure compatibility with Java
https://broadinstitute.github.io/picard/