huanglikun / BRM

Block Regression Mapping (BRM) is a statistical method for QTL mapping based on bulked segregant analysis by deep sequencing
GNU General Public License v3.0
7 stars 3 forks source link

Block Regression Mapping (BRM)

Block Regression Mapping (BRM) is a statistical method for QTL mapping based on bulked segregant analysis by deep sequencing. The core function is programmed by R language. For the detailed description of the method, please see the original article "BRM: A statistical method for QTL mapping based on bulked segregant analysis by deep sequencing" published in Bioinformatics.

Please cite: Huang L, Tang W, Bu S, et al. BRM: A statistical method for QTL mapping based on bulked segregant analysis by deep sequencing. Bioinformatics, 2019. https://doi.org/10.1093/bioinformatics/btz861

Content

Introduction

BRM is a method of BSA-seq for mapping QTLs or major genes. It can apply to different populations including recombinant inbred lines (RIL), doubled haploid (DH), haploid (H), F2 , F3, and so on.

BRM finds out candidate QTL (or gene) peaks in three main steps. The first step is to divide the genome into many small blocks of equal size and calculate the average allele frequency (AF) of each block in each pool, the average allele frequency in the population (AFP) of each block as well as the allele frequency difference (AFD) between two pools in each block. The second step is to figure out the AFD threshold of the 5% overall significance level at every genomic position. The third step is to identify possible QTL positions (significant AFD peaks) and calculate the 95% confidence interval of each QTL. The BRM scripts output the above results in two files. The first file contains the results of step one and step two, and the second file contains the results of step three.

back to top

Getting started

Download all scripts and examples from GitHub, then you can have a try with the example data:

git clone https://github.com/huanglikun/BRM.git
cd BRM
# Usage: 
# Rscript BRM.R <Block regression mapping configuration file> <Chromosome length file> <Input data in bsa format>
# Design A
Rscript BRM.R configureExample/designA/BRM_conf.txt configureExample/chr_length.tsv dataExample/designA/yeast_markers_dp10.bsa
# Design B (the experiment which has a high selected pool and a random pool)
Rscript BRM.R configureExample/designBH/BRM_conf.txt configureExample/chr_length.tsv dataExample/designBH/yeast_markers_dp10.bsa
# Design B (the experiment which has a low selected pool and a random pool)
Rscript BRM.R configureExample/designBL/BRM_conf.txt configureExample/chr_length.tsv dataExample/designBL/yeast_markers_dp10.bsa

back to top

Input data file

The input data file is a tab-separated values file named with bsa format. It contains six columns:

Column 1 Column 2 Column 3 Column 4 Column 5 Column 6
Chromosome code Marker position (bp) a b c d

Note: a, b, c and d stand for the counts of marker allele of PARENT 1 in pool 1 (high pool in Design A or selected high/low pool in Design B), PARENT 2 in pool 1, PARENT 1 in pool 2 (low pool in Design A or random pool in Design B) and PARENT 2 in pool 2, respectively.

Pools illustration

back to top

Input configuration files

Note: The chromosome code in data file should corresponding to the chromosome code in chromosome length file.

back to top

About uα/2

The uα/2 values in various populations

back to top

The uα/2 values in random-mating progeny populations

For the case that the progeny population of a cross between two pure-line parents is not generated by selfing but by random mating, the uα/2 value will be larger. The following table shows the uα/2 values of yeast and maize in different random-mating progeny populations, where H1 (or H) is the gamete (haploid) generated by F1, and R1 (or F2) is the sporophyte generated by combination of F1 gametes; H2 is the gamete generated by R1 (F2), and R2 is the sporophyte generated by combination of R1 (F2) gametes; the others can be deduced likewise.

back to top

Output files

Result 1 file

This file contains one line to show the theoretical threshold (assuming AF = 0.5 in the population) and a 11 columns table following. The AF is defined by the allele from Parent 1.

The data table described as below:

Column Heading Description
1 Chr. Chromosome
2 Pos. Block position (bp)
3 AF1-Observed Observed value of block average allele frequency in pool 1 (AF1)
4 AF1-Expected Expected value of block average allele frequency in pool 1 (AF1)
5 AF2-Observed Observed value of block average allele frequency in pool 2 (AF2)
6 AF2-Expected Expected value of block average allele frequency in pool 2 (AF2)
7 AFD-Observed Observed value of allele frequency differency (AF1 - AF2)
8 AFD-Expected Expected value of allele frequency differency (AF1 - AF2)
9 AFP-Observed Observed value of block average allele frequency in the population (AFP)
10 AFP-Expected Expected value of Observed value of block average allele frequency in the population (AFP)
11 Sample threshold Threshold estimated based on the expected AFP

Result 2 file

This file shows the information of candidate QTLs (significant AFD peaks). It contains six columns. By far, some peaks are needed to filter out manually to get the final candidate QTL list.

Column Heading Description
1 Chr. Chromosome code
2 Pos. Position of block center
3 Val. Peak value of AFD
4 Peak Dir. Peak direction: +, upward; -,downward
5 Start Start point of confidence interval
6 End End point of confidence interval

Q&A

  1. If I have the VCF file generated by Freebayes/GATK, how can I convert the VCF format into BSA format?

    A perl script is provided for transforming the VCF format into BSA format.

    Quick start

      # Usage:
      # perl tools/vcf2bsa.pl <samples information file> <markers.vcf> <output file>
      perl tools/vcf2bsa.pl configureExample/vcf2bsa/vcf2bsa_conf.txt dataExample/vcf2bsa/markers.freebayes.vcf result/markers.bsa

    Example

    configureExample/vcf2bsa/vcf2bsa_conf.txt

    # Samples in 2x2 Table
    # Bulk/Pool 1: the first sample
    Table2x2.pool1 = low-pool-RG
    # Bulk/Pool 2: the second sample
    Table2x2.pool2 = random-pool-RG
    # Parent 1
    Table2x2.parent1 = P1-RG
    # Parent 2
    Table2x2.parent2 = P2-RG

    It's a key-value file contains samples relationship. The separator is "=". And the space between key and value will be ignored. There are four parameters needed to be set:

    Key Value type Description
    Table2x2.pool1 String Design A: high selected pool RG tag in VCF file.
    Design B: selected pool RG tag in VCF file.
    Table2x2.pool2 String Design A: low selected pool RG tag in VCF file.
    Design B: random pool RG tag in VCF file.
    Table2x2.parent1 String High phenotype value parent sample RG tag in VCF file.
    Table2x2.parent2 String Low phenotype value parent sample RG tag in VCF file.

    RG tag is the “Read Group” setting at the reads mapping step. For example, the -R parameter value in BWA.

    If one parent is missing, the script will consider the genotype different from the known parent sample as the other parent genotype. If both parents are missing, the script will consider the reference genotype as the known parent genotype.

    back to top

  2. Why does the output show that the data size of one chromosome is 0?

    If the chromosome code in the data file and the chromosome code in the chromosome length file are not the same, such report will be shown. The chromosome codes in the two files should be exactly the same. It is case sensitive.

    back to top

  3. What is the suitable block size?

    We recommend a physical distance approximately equivalent to a genetic distance of 0.1 cM as the block size. For example, in yeast, 1 cM is approximately equivalent to 2.5 kb on average, so we choose 0.2 kb as the block size. In rice, 1 cM is about equivalent to 250 kb, so we choose 20 kb as the block size.

    back to top