naumenko-sa / crt

transcriptome analysis scripts (historically clinical research transcriptome)
4 stars 4 forks source link

1. Installation

  1. Clone crt, cre, and bioscripts to ~/crt, ~/cre, and ~/bioscripts.
  2. Set PATH: export PATH=~/crt/scripts:~/cre/scripts:~/bioscripts/scripts in ~/.bash_profile.
  3. Install bcbio_nextgen, set PATH and PYTHONPATH:
    export PATH=/path/bcbio/anaconda/bin:$PATH
    export PYTHONPATH=/path/bcbio/anaconda/lib/python2.7

2. Run bcbio with ~/crt/config/crt.bcbio.default.yaml.

  1. name sample SX_case-N-tissue, i.e. S101_55-1-F.
  2. create project/input, project = SX, input files [sample]_1.fq.gz, [sample]_2.fq.gz.
  3. crt.prepare_bcbio_run.sh $project.
  4. (optional) add strandedness: firststrand for stranded samples.
  5. qsub ~/cre/scripts/bcbio.pbs -v project=project Don't trim reads to save all data and delete (acrhive) fastq files.

3. Small variant report

  1. Copy SX-gatk-haplotype-annotated.vcf.gz from bcbio_output/final.
  2. Annotate variants:
    qsub ~/cre/cre.vcf2cre.sh -v original_vcf=SX-gatk-haplotype-annotated.vcf.gz,project=SX
  3. Generate reports (SX dir prepared by step2 should be ./SX, i.e. in the current dir):
    qsub ~/cre/cre.sh -v family=SX,type=rnaseq - including rare intronic variants.
    qsub ~/cre/cre.sh -v family=SX - only rare coding/splicing variants.

4. Gene expression outlier analysis (RPKM)

  1. qsub ~/crt/crt.feature_counts.sh -v bam=file.bam - counts for RPKM calculation in R.
  2. Use read.feature_counts_dir() from ~/crt/crt.utils.R to create RPKM matrix.
  3. crt.GonorazkyNaumenko2018.R - functions for muscular project.

5. Isoform expression outlier analysis (TPM)

6. Splicing analysis (based on MendelianRNA-seq)

1. (For every bam file including controls). Get junctions from a bam file with samtools:

qsub ~/crt/scripts/crt.bam2junctions.pbs -v bam=file.bam,
output: file.bam.junctions.txt

2. Prepare reference database

Load GENCODE junctions:

python3 ~/crt/AddJunctionsToDatabase.py \
--addGencode \
-transcript_model=~/crt/gencode.comprehensive.splice.junctions.txt

output: SpliceJunction.db

Load junctions from control samples: qsub ~/crt/crt.load_junctions.pbs -v bam=file.bam output: SpliceJunction.db

3. (For every sample) Load junctions from a sample to SpliceJunctions.db

qsub ~/crt/crt.load_junctions.pbs -v bam=file.bam

4. Filter rare junctions: qsub ~/crt/crt.filter_junctions.sh -v bam=file.bam

5. Dependencies

1. ~/crt/genes.bed - protein coding genes. 

2. ~/crt/gencode.comprehensive.splice.junctions.txt - junctions from gencode v19.

3. [Python 3.5.2]

4. Python [CIGAR string library](https://pypi.python.org/pypi/cigar/0.1.3) by Brent Pedersen

5. sqlite3 Python library SQLite3 version >=3.11.0.
```
import sqlite3
print (sqlite3.sqlite_version_info)
```

6. Differences from Beryl Cumming's original MendelianRNA-seq

4.6 Footnotes

A junction in multiple gene regions

A single gene region can encompass partial or whole regions of other genes. Thus, the same junction can appear in 2 different gene text files in a sample folder generated by SpliceJunctionDiscovery. Whether a junction belongs to two or more genes is not always factually correct. However, for the sake of inclusion and for the fact that this rarely happens, this edge case has been accounted for in AddJunctionsToDatabase.py in two ways:

1. The mapping of a single junction to multiple genes has been done with the table GENE_REF
2. If the script encounters the same junction in a sample more than once, it will utilize the result with the highest read count for read count and normalized read count and will discard the other.

Splice site flanks and annotation

A +/- flanking region is considered when annotating the 5' and 3' positions of sample junctions to increase the number of annotated junctions. This value is specified by the -flank parameter (default 1). There is an option to not use flanking at all (-flank 0).

6. References and resources

7. List of scripts