kunstner / freec2gistic

Workflow: Convert CONTROL-freec output to GISTIC2
6 stars 3 forks source link

freec2gistic

Workflow: Convert CONTROL-freec output to GISTIC2

Aim

The presented workflow converts the output generated by CONTROL-freec to GISTIC2 to be able to find high confident somatic copy-number alterations.

External guidelines

I found this resource which was useful to get the correct input format for GISTIC2

https://www.biostars.org/p/426635/

Input

From CONTROL-freec the following two files are needed: PREFIX _ratio.txt and PREFEX sample.cpn

Additionally, CONTROL-freec provides a script to convert output files to BED format (freec2bed.pl; https://github.com/BoevaLab/FREEC/tree/master/scripts/freec2bed.pl)

The ratio file contains information about ratios and predicted copy number alterations for each window. The cpn file contains information about the number of reads per called window.

Furthermore, a file with chromosome name and length is needed (e.g. _refhg19.genome).

Convert the ratio file to BED

Run the freec2bed.pl script on the ratio file, e.g. freec2bed -f Lx002031.bam_ratio.txt > Lx002031.freec_segments.bed

If cohorts are analyzed, the following lines of code loop through a directory (assumption: all files are stored in the same directory)

INPUTRATIO=/some/path/to/freec/output/files

for file in ls $INPUTRATIO*bam_ratio.txt; do
    echo $file
    OUTFILE=$(echo $file | sed -e "s/bam_ratio.txt/freec_segments.bed/g")
    freec2bed.pl -f $file > $OUTFILE
done

Join BED files in R

Files are joined and the ratio is transformed applying log2(ratio)-1 (according to https://www.biostars.org/p/426635/)

library(tidyverse)
library(GenomicRanges)

# list of files with pattern matching
files <- list.files(path = "../data/lymphome_WXS/cnvs", pattern="freec_segments.bed", full.names=TRUE)

segData <- NULL
for( i in files ) {
    print(i)
    tab <- read.table(i)
    id <- gsub(pattern = "../data/lymphome_WXS/cnvs.", replacement = "", i)
    id <- gsub(pattern = ".freec_segments.bed", replacement = "", id)
    tab$Sample <- id
    tab$Dummy <- 'NA'
    tab <- tab[, c("Sample", "V1", "V2", "V3", "Dummy", "V4")]
    # convert ratio to log2(ratio)-1
    tab$V4 <- log2(tab$V4) - 1
    segData <- rbind(segData, tab)
}

Next, chromosomes X and Y are removed and columns are renamed.

segData <- segData %>% 
    dplyr::filter( V1 != 'chrX' & V1 != 'chrY') %>% 
    dplyr::rename( Chromosome = V1, Start = V2, End = V3, log2ratio = V4 )

segData$Chromosome <- gsub(pattern = "chr", replacement = "", segData$Chromosome)

Add coverage information and merge ratio and coverage

Retrieve coverage information:

files <- list.files(path = "../data/lymphome_WXS/cnvs", pattern="bam_sample.cpn", full.names=TRUE)
cpnData <- NULL
for( i in files ) {
    print(i)
    tab <- read.table(i)
    id <- gsub(pattern = "../data/lymphome_WXS/cnvs.", replacement = "", i)
    id <- gsub(pattern = ".bam_sample.cpn", replacement = "", id)
    tab$Sample <- id
    tab <- tab[, c("Sample", "V1", "V2", "V3", "V4", "V5")]
    cpnData <- rbind(cpnData, tab)
}

colnames(cpnData) <- c("Sample", "Chromosome", "Start", "End", "Number", "Range")

Combine coverage and ratios

segData_num <- NULL
for ( i in unique( cpnData$Sample ) ) {
    print(i)
    cpn.s <- cpnData %>% dplyr::filter( Sample == i ) %>% dplyr::select( -Sample ) %>% GRanges
    seg.s <- segData %>% dplyr::filter( Sample == i ) %>% dplyr::select( -Sample ) %>% dplyr::rename(Num_Probes=Dummy) %>% GRanges

    # sum features per interval
    for ( j in 1:nrow( data.frame(seg.s) ) ) {
        Num_Probes <- sum( data.frame( cpn.s[ c(data.frame( findOverlaps(seg.s[j,], cpn.s) )$subjectHits), ] )$Number )
        mcols( seg.s[j, ] )$Num_Probes <- Num_Probes
    }

    seg.s <- data.frame(seg.s)
    seg.s$Sample <- i
    segData_num <- rbind(segData_num, seg.s)

}

segData_num <- segData_num %>% 
    dplyr::select( Sample, seqnames, start, end, Num_Probes, log2ratio)

Finally, save all files

write.table(x = segData, file = "gistic.segments.txt", sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = segData_num, file = "gistic.segments.numProbes.txt", sep = "\t", row.names = F, col.names = F, quote = F)

If necessary, common CNAs can be removed as described below. Here, a panel of common mutations for hg19 was retrieved from a Broad Institute panel of normals (ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/)

common.cnvs <- GRanges( read.delim( file = "../data/CNV.hg19.bypos.111213.txt" ) )
# get index of all CNVs without a hit in common.cnv
#subsetByOverlaps( GRanges(cnvdata[4, 1:4]), common.cnvs ) 
idx <- which( countOverlaps( GRanges(segData[, -1]), common.cnvs ) == 0 )
segData <- segData[idx, ]
rownames(segData) <- NULL

Prepare input for GISTIC2

The second line in the loop before can be skipped by adjusting the R code above. Anyhow, it does not hurt.

INPUT_FREEC=gistic.segments.numProbes.txt
REF_IDX=ref_hg19.genome
WORK_DIR=gistic_input

mkdir -p $WORK_DIR

for sample in $(cut -f1 $INPUT_FREEC | uniq); do
    echo $sample
    grep "$sample\t" $INPUT_FREEC |cut -f2-6|sed 's/^/chr/' > $WORK_DIR/$sample.bed
    bedtools sort -i $WORK_DIR/$sample.bed -faidx $REF_IDX > $WORK_DIR/$sample.sorted.bed
    bedtools complement -i $WORK_DIR/$sample.sorted.bed -g $REF_IDX > $WORK_DIR/$sample.complement.bed
    awk '{print $0, "\tNA\t0"}' $WORK_DIR/$sample.complement.bed > $WORK_DIR/$sample.CN2.bed
    cat $WORK_DIR/$sample.sorted.bed $WORK_DIR/$sample.CN2.bed > $WORK_DIR/$sample.whole.bed
    bedtools sort -i $WORK_DIR/$sample.whole.bed -faidx $REF_IDX > $WORK_DIR/$sample.whole.sorted.bed
    awk -v sample=$sample '{print sample"\t" $0}' $WORK_DIR/$sample.whole.sorted.bed > $WORK_DIR/$sample.whole.sorted.seg
done

cd $WORK_DIR
find . -name "*.whole.sorted.seg" -exec cat {} \; > ../samples.seg
cd ..

The file samples.seg is the final input for GISTIC2. Basically, we are done here

Running GISTIC2

GISTIC2 uses a couple of MATLAB scripts. A docker image and install instructions are provided here

https://hub.docker.com/r/shixiangwang/gistic

Basically, just run docker pull shixiangwang/gistic to install the image.

To run the image use

sudo docker run -it shixiangwang/gistic /bin/bash

change the working directory and create the input and ouput directory

cd /opt/GISTIC/
mkdir indat
mkdir gistic_out

We have to transfer our samples.seg file to the docker image. Therefor, we have to retrieve the container id for the running image and copy the file (here, the ID is 16d57225d045 but it depends on your local environment).

docker container ls
docker cp samples.seg 16d57225d045:/opt/GISTIC/indat/

Now, we can run GISTIC2

./gistic2 -b gistic_out -seg indat/samples.seg -refgene refgenefiles/hg19.mat

and copy the result files (from outside the docker image)

docker cp 16d57225d045:/opt/GISTIC/gistic_out ./

DONE