gustaveroussy / EaCoN

Easy Copy Number !
MIT License
20 stars 14 forks source link
copy-number exome-sequencing microarray r segmentation

EaCoN

Easy Copy Number !


DESCRIPTION

EaCoN aims to be an all-packed in, user-friendly solution to perform relative or absolute copy-number analysis for multiple sources of data, with three different segmenters available (and corresponding three copy-number modelization methods). It consists in a series of R packages that perform such type of analysis, from raw CEL files of Affymetrix microarrays (GenomeWide snp6, OncoScan, CytoScan 750K, CytoScan HD) or from aligned reads as BAMs for WES (whole exome sequencing).


FEATURES


NOTES

QUICK NEWS

2021-10-18 : v0.3.6-2 (CloudyMonday2) is out !

2021-06-18 : v0.3.6-1 (SweetSummerSweat) is out !

2021-05-23 : v0.3.6 (Barolo) is out !

2020-08-17 : v0.3.5 (CloudyMonday) is out !

2018-12-10 : v0.3.4-1 (PostRoscovite) is out !

2018-10-30 : v0.3.4 (Papy60) is out !

2018-10-02 : v0.3.3-1 (LittleWomanNoCry) is out !

2018-09-12 : v0.3.3 (Trinity) is out !

2018-08-08 : v0.3.2 (PapeMamiePichine) is out !


INSTALLATION

CORE

MICROARRAY-SPECIFIC

While the current EaCoN package is the core of the process and will straightly work for WES data, multiple other packages are needed to properly handle Affymetrix microarray : APT (affymetrix power tools), designs and corresponding annotations (genome build, Affymetrix annotation databases) ; others are required for the (re)normalization, especially pre-computed GC% or Wavetracks.

ALL AFFYMETRIX MICROARRAYS

ONCOSCAN FAMILY (OncoScan / OncoScan_CNV)

CYTOSCAN FAMILY (CytoScan 750k / CytoScan HD)

GENOMEWIDE SNP6

GENOMES


INPUT


USAGE

The full workflow is decomposed into a few different functions, which roughly correspond to these steps :

normalization -> segmentation +-> reporting
                              |
                              +-> copy-number estimation

EaCoN allows different ways of running the full workflow : considering the analysis of a single sample, you can either run each step independently and write, then load the intermediate results, or you can pipe all steps in a single line of code. You can also run the step-by-step approach on multiple samples in a row, even possibly at the same time using multithreading, using a batch mode.

Step by step mode

First, under R, load EaCoN and choose a directory for writing results, for example : /home/project/EaCoN_results

  require(EaCoN)
  setwd("/home/project/EaCoN_results")

Raw data processing

Affymetrix OncoScan / OncoScan_CNV
Affymetrix CytoScan 750k / CytoScan HD
Affymetrix GenomeWide SNP6
WES data

L2R & BAF Segmentation

Copy-number estimation

HTML reporting

Batch mode (with multithreadng)

All the steps described above in single sample mode can be run in batch mode, that is for multiple samples, possibly combined with multithreading to process multiple samples in parallel. It simply consists into using different functions with the same name but an added ".Batch" suffix. Those are just wrappers to the single-sample version of the functions.

Raw data processing

Affymetrix OncoScan / OncoScan_CNV

The OS.Process.Batch function replaces the ATChannelCel, GCChannelCel and samplename parameters by the pairs.file parameters, which consists in a tab-separated file with made of three columns with a header, and one sample per line :

By default, the function will run all samples one by one, but multithreading can be set using the nthread parameter with a value greater than 1. Beware not setting a value higher than the current number of available threads on your machine ! Please also remember that each new thread will use its own amount of RAM...

Here is a synthetic example with 4 samples :

ATChannelCel GCChannelCEL SampleName
/home/project/CEL/S1_OncoScan_CNV_A.CEL /home/project/CEL/S1_OncoScan_CNV_C.CEL S1_OS
/home/project/CEL/S5_OncoScan_CNV_A.CEL /home/project/CEL/S5_OncoScan_CNV_C.CEL S5_OS
/home/project/CEL/S6_OncoScan_CNV_A.CEL /home/project/CEL/S6_OncoScan_CNV_C.CEL S6_OS
/home/project/CEL/S7_OncoScan_CNV_A.CEL /home/project/CEL/S7_OncoScan_CNV_C.CEL S7_OS
Affymetrix CytoScan 750k / CytoScan HD

Same principle, but this time we have one column less and header changes a bit :

Here is a synthetic example with 4 samples :

CEL SampleName
/home/project/CEL/S8_CytoScanHD.CEL S8_CSHD
/home/project/CEL/S9_CytoScanHD.CEL S9_CSHD
/home/project/CEL/S10_CytoScanHD.CEL S10_CSHD
/home/project/CEL/S11_CytoScanHD.CEL S11_CSHD
Affymetrix GenomeWide SNP6

Identical to CytoScan 750k / HD, but the function is named SNP6.Process.Batch.

WES data

Still the same principle with an external list file, with column names :

Here is a synthetic example with 4 samples :

testBAM refBAM SampleName
/home/project/WES/S4_WES_hg19_Tumor.BAM /home/project/WES/S4_WES_hg19_Normal.BAM S4_WES
/home/project/WES/S12_WES_hg19_Tumor.BAM /home/project/WES/S12_WES_hg19_Normal.BAM S12_WES
/home/project/WES/S13_WES_hg19_Tumor.BAM /home/project/WES/S13_WES_hg19_Normal.BAM S13_WES
/home/project/WES/S14_WES_hg19_Tumor.BAM /home/project/WES/S14_WES_hg19_Normal.BAM S14_WES

Note that here we did not specify any RDS or list file to WES.Normalize.ff.Batch. This is because this fonction needs as its first argument BIN.RDS.files, a list of "_binned.RDS" files (generated at the former command line), and by default it will recursively search downwards the current working directory for any of these RDS files. You can of course design your own list of RDS files to process, if you know a bit of R.

L2R & BAF Segmentation

As for the WES.Normalize.ff.Batch function, the Segment.ff.Batch function needs as its first argument RDS.files, a list of "_processed.RDS" files (generated at the raw data processing step). Likewise, it will by default recursively search downwards for any compatible RDS file.

Here is a synthetic example that will segment our CytoScan HD samples (as defined by the pattern below) using ASCAT :

  Segment.ff.Batch(RDS.files = list.files(path = getwd(), pattern = ".*_processed.RDS$", full.names = TRUE, recursive = TRUE), segmenter = "ASCAT", smooth.k = 5, SER.pen = 20, nrf = 1.0, nthread = 2)

Copy-number estimation

Still the same, with the ASCN.ff.Batch :

  ASCN.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), nthread = 2)

HTML reporting

And here again with the Annotate.ff.Batch :

  Annotate.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), author.name = "Me!")

Piped

EaCoN has been implemented in such a way that one can also opt to launch the full workflow in a single command line for a single sample, using pipes from the magrittr package. However, this is not recommended as a default use : even though EaCoN is provided with recommendations that should fit most cases, users may have to deal with particular profiles requiring parameter tweaking, which is not possible in piped mode... Here is an example using ASCAT :

  samplename <- "SAMPLE1_OS"
  workdir <- "/home/me/my_project/EaCoN_results"
  setwd(workdir)
  require(EaCoN)
  require(magrittr)

  OS.Process(ATChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_A.CEL", GCChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_C.CEL", samplename = samplename, return.data = TRUE) %>% Segment(out.dir = paste0(workdir, "/", samplename), segmenter = "ASCAT", return.data = TRUE) %T>% Annotate(out.dir = paste0(workdir, "/", samplename, "/ASCAT/L2R")) %>% ASCN.ASCAT(out.dir = paste0(workdir, "/", samplename))

Conclusion on usage


GUIDELINES

Segmentation

SOURCE SER.pen smooth.k nrf BAF.filter
OncoScan 40 (default) NULL (default) 0.5 (default) 0.9
CytoScan HD 20 5 1.0 0.75 (default)
SNP6 60 5 0.25 0.75 (default)
WES 2 to 10 5 0.5 (default) to 1 0.75 (default)

NOTES


AUTHORS & CONTACT