vborjesson / TC_hunter

Locate transgenic insertion sites
MIT License
8 stars 0 forks source link

TC_hunter

TC-hunter identifies transgenic insertion sites within host genome

TC-hunter searches for transgenic insertion sites in a host genome and returns figures and a report to support these findings.

There's two programs; TC_hunter and TC_hunter_BWA.

TC_hunter_BWA accepts raw pair end fastq files (from one or several samples) as input and performes BWA MEM alignment before searching for trasgenic insertion site.

Accepts one or several aligned BAM files (mapped to both host and transgenic sequence) as input. TC-hunter then identifies anchors and chimeric reads that maps to both host and transgenig sequence.

Install TC-hunter

Clone the repository from Github and put it in your path (or add the direct path to config file)

$ git clone https://github.com/vborjesson/TC_hunter.git
$ export PATH="/home/yourPath/TC_hunter":$PATH

Software Dependencies

In order to run TC_hunter you need to have some programs installed. Here's three options on how you can do it:

  1. Install required programs and tools using conda yml-file (prefered). Has been tested on Anconda3, Anaconda2 and Miniconda2

    $ conda env create --file TC_hunter/Scripts/TC_hunter.yml
    $ source activate TC_hunter_v1.0
  2. Create your own conda environment

    $ conda create -n TC_hunter R=3.5
    $ source activate TC_hunter
    $ conda install -c bioconda samtools=1.10
    $ conda install -c bioconda nextflow=19.01.0
    (only if runing TC_hunter_BWA) $ conda install -c bioconda bwa
    $ conda install -c anaconda pandas
    $ conda install -c conda-forge r-circlize
    $ conda install -c r r-dplyr
    $ conda install -c r r-data.table
  3. Download manually

softwares

R 3.5 or higher
python 2.7
samtools 1.10 (works on other versions as well)
nextflow 19.01.0
bwa 0.7

R packages

circlize
dplyr
data.table

Run TC_hunter with test data (takes approximately 1 minute to run)

Download data

mkdir test_run 
cd test_run
pip install gdown # If you don't already have it installed
gdown https://drive.google.com/drive/folders/1Y-iCNo71OVmf3QqJeFrukxUlQGDojSKx?usp=sharing
cp ../TC_hunter/Test_data/* .

Then run TC_hunter:

nextflow ../TC_hunter/TC_hunter.nf -c testrun.config --workingDir <realpath_to_test_run_dir> --tc_hunter_path <realpath_to_tchunter>

You should see TC_hunter running each process one after each other

  1. samtools_index
  2. create_links_sup
  3. create_links_soft
  4. create_karyotype
  5. create_histogram
  6. create_plots
  7. create_html

When it's done check that you have an output_summary.html file.

Create construct.txt file (required)

In order to generate figures with construct information, you need to add this informtaion. Create a txt-file with gene info per line, separated by space. The info should be; 1) name, 2) start position and 3) end position.

e.g.

Amp 1 500   
lyz 1000 1200       
Gene3 2000 5000           
Gene4 7000 7700             

Make Configuration file

Create a configuration file from template.

$ cp TC_hunter/template/TC_hunter.config /path/to/WorkingDir 

Add required information to config file

TC_hunter.nf

Argument Usage Description
WorkingDir <Path/WorkingDir> Path to your working directory (this is where the output html and figures will be)
TC_hunter_path <Path/TC_hunter> Path to TC_hunter, only TC_hunter if it's in your $PATH
Construct_file <Path/construct.txt> Path to your construct.txt file (See Create construct.txt file above)
Construct_length The length of your construct in numbers
Construct_name The name of the construct, most match the name in the reference file, no space
bam The path to the directory where you have your bam file or (if several sampes) bam files.
Reference Path to the merged reference file including both host and construct genome. cat host_ref construct_ref > Jointref.fa

e. g. example.config

TC_hunter_BWA.nf

Argument Usage Description
WorkingDir <Path/WorkingDir> Path to your working directory (this is where the output html and figures will be)
TC_hunter_path <Path/TC_hunter> Path to TC_hunter, only TC_hunter if it's in your $PATH
Construct_file <Path/construct.txt> Path to your construct.txt file (See Create construct.txt file above)
Construct_length Length in numbers of your construct that will be plotted
Construct_name Name of the construct, most match the name in your reference file
sample Path to directory where you have the fastq-files (needs to have the name 'R1' and 'R2')
folder Path to directory containing one directory for each sample. The name of the samples will be the same as the directory names
host_ref Path to host reference file
construct_ref Path to construct reference file

e. g. example.config

Run TC_hunter.nf

Before running, make sure you have a config file with all required information (see "Make Configuration file").

$ nextflow TC_hunter.nf -c <file.config> [-with-report <report name>]

Run TC_hunter_BWA.nf

Before running, make sure you have a config file with all required information (see "Make Configuration file").

$ nextflow TC_hunter_BWA.nf -c <file.config> [-with-report <report name>]

Run IGV separately

In order to get the IGV figures you need to have GUI available. If not; you can run IGV separately when TC-hunter is finished. Run one .bat file for each sample.

$ igv.sh -b <sample_name.bat>

Understand your output

TC-hunter finds insertion sites based on chimeric and discordant read pair.

TC_hunter reports each possible insertion site in an html file called output_summary.html. The file contains 5 columns; 1) Ranking - best hit based on score is ranked first, second best second etc, 2) Score - Based on the number of chimeric and discordant read pairs supporting this insertion site, 3) Breakpoint host - Where in the host is this insertionsite located, 4) Breakpoint construct - Where in the construct is this insertion site located, 5) figures - three figures I) circular plot (see below), II) igv, III) igv more zoomed in.

For every predicted insertion site a circular figure is created. Red links, "lines" represent every discordant read pair supporting this event. Black links represent chimeric reads supporting this event. !

Supporting material for TC_hunter paper

Supporting data