Easy Copy Number !
EaCoN aims to be an all-packed in, user-friendly solution to perform relative or absolute copy-number analysis for multiple sources of data, with three different segmenters available (and corresponding three copy-number modelization methods). It consists in a series of R packages that perform such type of analysis, from raw CEL files of Affymetrix microarrays (GenomeWide snp6, OncoScan, CytoScan 750K, CytoScan HD) or from aligned reads as BAMs for WES (whole exome sequencing).
Please first install the devtools package that will allow installing packages from github :
install.packages('devtools')
WARNING : If you get a GITHUB_PAT error when using the _devtools::installgithub() function, please run the following line once per session before running _devtools::installgithub() :
Sys.unsetenv("GITHUB_PAT")
Then install ASCAT and FACETS from github :
devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")
devtools::install_github("mskcc/facets")
Then install required CRAN package(s) :
## try using http:// if https:// URLs are not supported
if(!installed.packages('BiocManager')) install.packages('BiocManager')
install.packages('sequenza')
Then install required BioConductor packages :
## try using http:// if https:// URLs are not supported
if(!installed.packages('BiocManager')) install.packages('BiocManager')
BiocManager::install(c("affxparser", "Biostrings", "aroma.light", "BSgenome", "copynumber", "GenomicRanges", "limma", "rhdf5", "sequenza"))
Then install EaCoN from github !
## Install the most recent STABLE version (@master)
devtools::install_github("gustaveroussy/EaCoN")
While the current EaCoN package is the core of the process and will straightly work for WES data, multiple other packages are needed to properly handle Affymetrix microarray : APT (affymetrix power tools), designs and corresponding annotations (genome build, Affymetrix annotation databases) ; others are required for the (re)normalization, especially pre-computed GC% or Wavetracks.
The affy.CN.norm package provides pre-computed GC% and wave-effect (re)normalization datasets for all compatible Affymetrix designs, for both NA33/NA35 (hg19) and NA36 (hg38) human genome builds. Install from remote URL :
install.packages("https://zenodo.org/record/5494853/files/affy.CN.norm.data_0.1.2.tar.gz", repos = NULL, type = "source")
First, install embedded APT tool from github :
devtools::install_github("gustaveroussy/apt.oncoscan.2.4.0")
Then install annotations from remote URL :
For the NA33 (hg19) build :
For the OncoScan design :
install.packages("https://zenodo.org/record/5494853/files/OncoScan.na33.r4_0.1.0.tar.gz", repos = NULL, type = "source")
For the OncoScan_CNV design :
install.packages("https://zenodo.org/record/5494853/files/OncoScanCNV.na33.r2_0.1.0.tar.gz", repos = NULL, type = "source")
For the NA36 (hg38) build :
For the OncoScan design :
install.packages("https://zenodo.org/record/5494853/files/OncoScan.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
For the OncoScan_CNV design :
install.packages("https://zenodo.org/record/5494853/files/OncoScanCNV.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
First, install the embedded APT tool from github :
devtools::install_github("gustaveroussy/apt.cytoscan.2.4.0")
Then install annotations from remote URL :
For the NA33 (hg19) build :
For the CytoScan 750k design :
install.packages("https://zenodo.org/record/5494853/files/CytoScan750K.Array.na33.r4_0.1.0.tar.gz", repos = NULL, type = "source")
For the CytoScan HD design :
install.packages("https://zenodo.org/record/5494853/files/CytoScanHD.Array.na33.r4_0.1.0.tar.gz", repos = NULL, type = "source")
For the NA36 (hg38) build :
For the CytoScan 750k design :
install.packages(https://zenodo.org/record/5494853/files/CytoScan750K.Array.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
For the CytoScan HD design :
install.packages("https://zenodo.org/record/5494853/files/CytoScanHD.Array.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
Lastly, install the rcnorm package to perform BAF normalization for the CytoScan family of arrays :
install.packages("https://zenodo.org/record/5494853/files/rcnorm_0.1.5.tar.gz", repos = NULL, type = "source")
First, install the embedded APT tool from github :
devtools::install_github("gustaveroussy/apt.snp6.1.20.0")
Then install annotations from remote URL (There is no other build available than NA35 (hg19)) :
install.packages("https://zenodo.org/record/5494853/files/GenomeWideSNP.6.na35.r1_0.1.0.tar.gz", repos = NULL, type = "source")
Lastly, install the rcnorm package to perform BAF normalization for SNP6 arrays (if not already installed at the CytoScan step!) :
install.packages("https://zenodo.org/record/5494853/files/rcnorm_0.1.5.tar.gz", repos = NULL, type = "source")
EaCoN requires a genome as available thanks to the BSgenome package, available at BioConductor. To check which genomes are publicly availabe at BioConductor :
if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager')
BiocManager::install('BSgenome')
BSgenome::available.genomes()
To check genome(s) installed in your R library :
BSgenome::installed.genomes()
For Affymetrix microarrays, you need to install these human genomes depending on which annotation package you want to use :
if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager')
## To support NA33 / NA35 annotations (hg19)
BiocManager::install('BSgenome.Hsapiens.UCSC.hg19')
## To support NA36 annotations (hg38)
BiocManager::install('BSgenome.Hsapiens.UCSC.hg38')
For TCGA WES data, you will need the hs37d5 genome (a variation of the hg19 build used in the 1000 Genomes project)
if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager')
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
If your favorite genome is not available, it is possible to build your own !
The full workflow is decomposed into a few different functions, which roughly correspond to these steps :
normalization -> segmentation +-> reporting
|
+-> copy-number estimation
EaCoN allows different ways of running the full workflow : considering the analysis of a single sample, you can either run each step independently and write, then load the intermediate results, or you can pipe all steps in a single line of code. You can also run the step-by-step approach on multiple samples in a row, even possibly at the same time using multithreading, using a batch mode.
First, under R, load EaCoN and choose a directory for writing results, for example : /home/project/EaCoN_results
require(EaCoN)
setwd("/home/project/EaCoN_results")
Let's say we have a pair of OncoScan_CNV CEL files to analyse in a /home/project/CEL/ directory (Affymetrix OncoScan experiments have 2 arrays for a single experiment, thus a pair) :
OS.Process(ATChannelCel = "/home/project/CEL/S1_OncoScan_CNV_A.CEL", GCChannelCel = "/home/project/CEL/S1_OncoScan_CNV_C.CEL", samplename = "S1_OS")
This will perform the normalization step, create a /home/project/EaCoN_results/S1_OS/ subdirectory and write 5 files in it :
This is identical to OncoScan, except that we have a single CEL file, and we just have to change the function name :
CS.Process(CEL = "/home/project/CEL/S2_CytoScanHD.CEL", samplename = "S2_CSHD")
Here again, this is identical to OncoScan, except that we have a single CEL file, and we just have to change the function name :
SNP6.Process(CEL = "/home/project/CEL/S3_GenomeWide_snp.6.CEL", samplename = "S3_SNP6")
This time it is quite different : the processing will be performed in three steps :
BINpack.Maker(bed.file = "/home/project/WES/SureSelect_v5.bed", bin.size = 50, genome.pkg = "BSgenome.Hsapiens.UCSC.hg19")
This will generate a "BINpack" (with a ".rda" extension) to be used in the next normalization steps : /home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda
PLEASE NOTE THAT THIS STEP IS SAMPLE-INDEPENDENT, THUS NEEDS TO BE PERFORMED AGAIN ONLY IF YOU CHANGE EITHER THE CAPTURE BED, THE BIN SIZE OR THE GENOME BUILD. Thus, the generated BINpack can be used for any other sample in the same conditions.
Second, the WES data will be binned using the generated BINpack. We need three files as input :
The aligned reads for the test sample (usualy in cancer, the patient tumor), in BAM format
The aligned reads for the reference sample (patient normal), in BAM format too
The BINpack itself
WES.Bin(testBAM = "/home/project/WES/S4_WES_hg19_Tumor.BAM", refBAM = "/home/project/WES/S4_WES_hg19_Tumor.BAM", BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", samplename = "S4_WES")
This will generate a /home/project/EaCoN_results/S4_WES/ subdirectory which contains :
Third, now that the data have been binned, the normalization step can be performed :
WES.Normalize.ff(BIN.RDS.file = "/home/project/EaCoN_results/S4_WES/S4_WES_hg19_b50_binned.RDS", BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda")
Now that we described the normalization process specific to each type of source data, we can segment it. The good news is that it's the very same step for each source type, one just have to pass the _processed.RDS normalized file. Here is an example with the one obtained for OncoScan data, using the ASCAT segmenter :
Segment.ff(RDS.file = "/home/me/my_project/EaCoN_results/SAMPLE1/S1_OncoScan_CNV_hg19_processed.RDS", segmenter = "ASCAT")
To perform the same using the FACETS segmenter, just change the value of the segmenter parameter !
I suppose you guessed how to do the same with SEQUENZA, right ? ;)
Then an estimation of the total and allele-specific copy-number profiles, as well as global ploidy and sample cellularity can be estimated. Here is an example using the ASCAT ASCN estimation, from a RDS generated by the Segment.ASCAT() function :
ASCN.ff(RDS.file = "/home/me/my_project/EaCoN_results/SAMPLE1/ASCAT/L2R/SAMPLE1.ASCAT.RDS")
To perform the same using the FACETS or SEQUENZA estimator, just use a RDS generated with Segment.FACETS() or Segment.SEQUENZA(), respectively (or their ".ff" equivalent).
Finally, an annotated HTML report can be rendered with :
Annotate.ff(RDS.file = "/home/project/EaCoN_results/S1/ASCAT/L2R/S1.EaCoN.ASPCF.RDS", author.name = "Me!")
All the steps described above in single sample mode can be run in batch mode, that is for multiple samples, possibly combined with multithreading to process multiple samples in parallel. It simply consists into using different functions with the same name but an added ".Batch" suffix. Those are just wrappers to the single-sample version of the functions.
The OS.Process.Batch function replaces the ATChannelCel, GCChannelCel and samplename parameters by the pairs.file parameters, which consists in a tab-separated file with made of three columns with a header, and one sample per line :
By default, the function will run all samples one by one, but multithreading can be set using the nthread parameter with a value greater than 1. Beware not setting a value higher than the current number of available threads on your machine ! Please also remember that each new thread will use its own amount of RAM...
Here is a synthetic example with 4 samples :
ATChannelCel | GCChannelCEL | SampleName |
---|---|---|
/home/project/CEL/S1_OncoScan_CNV_A.CEL | /home/project/CEL/S1_OncoScan_CNV_C.CEL | S1_OS |
/home/project/CEL/S5_OncoScan_CNV_A.CEL | /home/project/CEL/S5_OncoScan_CNV_C.CEL | S5_OS |
/home/project/CEL/S6_OncoScan_CNV_A.CEL | /home/project/CEL/S6_OncoScan_CNV_C.CEL | S6_OS |
/home/project/CEL/S7_OncoScan_CNV_A.CEL | /home/project/CEL/S7_OncoScan_CNV_C.CEL | S7_OS |
The command line (using 2 threads)
OS.Process.Batch(pairs.file = "/home/project/CEL/OS_pairs.txt", nthread = 2)
Same principle, but this time we have one column less and header changes a bit :
Here is a synthetic example with 4 samples :
CEL | SampleName |
---|---|
/home/project/CEL/S8_CytoScanHD.CEL | S8_CSHD |
/home/project/CEL/S9_CytoScanHD.CEL | S9_CSHD |
/home/project/CEL/S10_CytoScanHD.CEL | S10_CSHD |
/home/project/CEL/S11_CytoScanHD.CEL | S11_CSHD |
The command line (using 2 threads)
CS.Process.Batch(pairs.file = "/home/project/CEL/CSHD_list.txt", nthread = 2)
Identical to CytoScan 750k / HD, but the function is named SNP6.Process.Batch.
Still the same principle with an external list file, with column names :
Here is a synthetic example with 4 samples :
testBAM | refBAM | SampleName |
---|---|---|
/home/project/WES/S4_WES_hg19_Tumor.BAM | /home/project/WES/S4_WES_hg19_Normal.BAM | S4_WES |
/home/project/WES/S12_WES_hg19_Tumor.BAM | /home/project/WES/S12_WES_hg19_Normal.BAM | S12_WES |
/home/project/WES/S13_WES_hg19_Tumor.BAM | /home/project/WES/S13_WES_hg19_Normal.BAM | S13_WES |
/home/project/WES/S14_WES_hg19_Tumor.BAM | /home/project/WES/S14_WES_hg19_Normal.BAM | S14_WES |
Binning, then normalization command lines (using 2 threads)
WES.Bin.Batch(BAM.list.file = "/home/project/WES/BAM_list.txt", BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", nthread = 2)
WES.Normalize.ff.Batch(BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", nthread = 2)
Note that here we did not specify any RDS or list file to WES.Normalize.ff.Batch. This is because this fonction needs as its first argument BIN.RDS.files, a list of "_binned.RDS" files (generated at the former command line), and by default it will recursively search downwards the current working directory for any of these RDS files. You can of course design your own list of RDS files to process, if you know a bit of R.
As for the WES.Normalize.ff.Batch function, the Segment.ff.Batch function needs as its first argument RDS.files, a list of "_processed.RDS" files (generated at the raw data processing step). Likewise, it will by default recursively search downwards for any compatible RDS file.
Here is a synthetic example that will segment our CytoScan HD samples (as defined by the pattern below) using ASCAT :
Segment.ff.Batch(RDS.files = list.files(path = getwd(), pattern = ".*_processed.RDS$", full.names = TRUE, recursive = TRUE), segmenter = "ASCAT", smooth.k = 5, SER.pen = 20, nrf = 1.0, nthread = 2)
To perform the same using the FACETS segmenter, just change the value of the segmenter parameter.
I suppose you guessed how to do the same with SEQUENZA, right ? ;)
Still the same, with the ASCN.ff.Batch :
ASCN.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), nthread = 2)
And here again with the Annotate.ff.Batch :
Annotate.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), author.name = "Me!")
EaCoN has been implemented in such a way that one can also opt to launch the full workflow in a single command line for a single sample, using pipes from the magrittr package. However, this is not recommended as a default use : even though EaCoN is provided with recommendations that should fit most cases, users may have to deal with particular profiles requiring parameter tweaking, which is not possible in piped mode... Here is an example using ASCAT :
samplename <- "SAMPLE1_OS"
workdir <- "/home/me/my_project/EaCoN_results"
setwd(workdir)
require(EaCoN)
require(magrittr)
OS.Process(ATChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_A.CEL", GCChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_C.CEL", samplename = samplename, return.data = TRUE) %>% Segment(out.dir = paste0(workdir, "/", samplename), segmenter = "ASCAT", return.data = TRUE) %T>% Annotate(out.dir = paste0(workdir, "/", samplename, "/ASCAT/L2R")) %>% ASCN.ASCAT(out.dir = paste0(workdir, "/", samplename))
SOURCE | SER.pen | smooth.k | nrf | BAF.filter |
---|---|---|---|---|
OncoScan | 40 (default) |
NULL (default) |
0.5 (default) |
0.9 |
CytoScan HD | 20 |
5 |
1.0 |
0.75 (default) |
SNP6 | 60 |
5 |
0.25 |
0.75 (default) |
WES | 2 to 10 |
5 |
0.5 (default) to 1 |
0.75 (default) |
The FACETS segmenter cannot currently be used on SNP6 data (due to missing normalized A and B signals).
The SEQUENZA segmenter SHOULD NOT be used with SNP6 microarrays (it theoretically can, but requires huge amounts of RAM, ie more than 32 GB). This may halt / swap your computer !
For WES data, any sorted BAM should work, but we recommend using BAMs for which duplicates were marked/removed (samtools markdup, Picard MarkDuplicates, etc...), for higher quality results.