Block Regression Mapping (BRM)

Block Regression Mapping (BRM) is a statistical method for QTL mapping based on bulked segregant analysis by deep sequencing. The core function is programmed by R language. For the detailed description of the method, please see the original article "BRM: A statistical method for QTL mapping based on bulked segregant analysis by deep sequencing" published in Bioinformatics.

Please cite: Huang L, Tang W, Bu S, et al. BRM: A statistical method for QTL mapping based on bulked segregant analysis by deep sequencing. Bioinformatics, 2019. https://doi.org/10.1093/bioinformatics/btz861

Introduction

BRM is a method of BSA-seq for mapping QTLs or major genes. It can apply to different populations including recombinant inbred lines (RIL), doubled haploid (DH), haploid (H), F₂ , F₃, and so on.

BRM finds out candidate QTL (or gene) peaks in three main steps. The first step is to divide the genome into many small blocks of equal size and calculate the average allele frequency (AF) of each block in each pool, the average allele frequency in the population (AFP) of each block as well as the allele frequency difference (AFD) between two pools in each block. The second step is to figure out the AFD threshold of the 5% overall significance level at every genomic position. The third step is to identify possible QTL positions (significant AFD peaks) and calculate the 95% confidence interval of each QTL. The BRM scripts output the above results in two files. The first file contains the results of step one and step two, and the second file contains the results of step three.

Getting started

Download all scripts and examples from GitHub, then you can have a try with the example data:

git clone https://github.com/huanglikun/BRM.git
cd BRM
# Usage: 
# Rscript BRM.R <Block regression mapping configuration file> <Chromosome length file> <Input data in bsa format>
# Design A
Rscript BRM.R configureExample/designA/BRM_conf.txt configureExample/chr_length.tsv dataExample/designA/yeast_markers_dp10.bsa
# Design B (the experiment which has a high selected pool and a random pool)
Rscript BRM.R configureExample/designBH/BRM_conf.txt configureExample/chr_length.tsv dataExample/designBH/yeast_markers_dp10.bsa
# Design B (the experiment which has a low selected pool and a random pool)
Rscript BRM.R configureExample/designBL/BRM_conf.txt configureExample/chr_length.tsv dataExample/designBL/yeast_markers_dp10.bsa

Input data file

The input data file is a tab-separated values file named with bsa format. It contains six columns:

Column 1	Column 2	Column 3	Column 4	Column 5	Column 6
Chromosome code	Marker position (bp)	a	b	c	d

Note: a, b, c and d stand for the counts of marker allele of PARENT 1 in pool 1 (high pool in Design A or selected high/low pool in Design B), PARENT 2 in pool 1, PARENT 1 in pool 2 (low pool in Design A or random pool in Design B) and PARENT 2 in pool 2, respectively.

Pools illustration

Example

dataExample/designBL/yeast_markers_dp10.bsa

I   33070   6   6   6   7
I   33147   7   7   2   5
I   33152   4   5   2   5
I   33200   3   9   4   3
I   33293   7   7   7   2  
......

Input configuration files

Chromosome length file

It's a tab-separated values file with two columns:

Column 1 Column 2

Chromosome code Chromosome length (bp)
- Example
  
  configureExample/chr_length.tsv
```
I    230218
II    813184
III    316620
......
```

Column 1	Column 2
Chromosome code	Chromosome length (bp)

Note: The chromosome code in data file should corresponding to the chromosome code in chromosome length file.

Block regression mapping configuration file

It's a key-value file containing parameters required for block regression mapping (see the following table). The separator is "=". The space between key and value will be ignored.

Key	Value type	Description	Recommend value
Design	A, BH, BL	A: Design A; BH: Design B with selected high pool; BL: Design B with selected low pool	Default: A
n1	Integer	Number of individuals in pool1 (high pool)	Depending on experiment design
n2	Integer	Number of individuals in pool2 (low pool)	Depending on experiment design
t	0, 1	Population type	t = 0 for DH, RI or H; t = 1 for F₂, F₃, etc.
ua	float	u_α/2 value	See the table below, or can be calculated by cal_ua_fk.R or cal_ua_rk.R in the "tools" directory (see explanation below).
UNIT	Integer	Unit block size (bp)	Default: 1000
DEG	Integer	The degree of the polynomials used in LOESS (local polynomial regression fitting)	2
BLK	Integer or float	Block size = BLK * UNIT	e.g.: yeast, BLK = 0.2 ; rice, BLK = 20
MIN	Integer	Minimum total depth in a valid block	10
MINVALID	Integer	Minimum valid block number in a chromosome	10
Result1_File	File path	To define the result 1 output path (optional)	Default: result/result1.xls
Result2_File	File path	To define the result 2 output path (optional)	Default: result/result2.xls

Example

configureExample/designBL/BRM_conf.txt

version 0.3

Experiment

Design = BL # available: A, BH, BL. BH: Design B with HIGH selected pool. BL: Design B with LOW selected pool. n1 = 300 # pool 1 size. The high pool for design A or the selected pool for design B. n2 = 300 # pool 2 size. The low pool for design A or the random pool for design B. t = 0 # For DH or RI etc., t=0; F2 or F3 etc., t=1 ua = 4.08 # For rice, F2:3.65; F3:3.74; F4:3.78.

Block regression

UNIT = 1000 # block unit(bp) DEG = 2 # The degree of the polynomials to be used in Local Polynomial Regression Fitting. BLK = 0.2 # block size = BLK * UNIT MIN = 10 # min total depth in block MINVALID = 10 # min valid blocks in one chromosome (needed to be at least 10)

output file determination (optional)

Result1_File = result/result1.xls # including allele frequency and threshold

Result2_File = result/result2.xls # including peak information and confidence interval

Result1_File = result_random_L/result1.xls # including allele frequency and threshold Result2_File = result_random_L/result2.xls # including peak information and confidence interval

About u_α/2

The u_α/2 values in various populations

The u_α/2 values of some species in various populations

Species	n^a	Genome size^b		Ratio (kb/cM)	u_α/2^c				Ref.
Species	n^a	cM	Mb	Ratio (kb/cM)	H/DH/F₂	F₃	F₄	RIL	Ref.
Arabidopsis	5	600	119	199	3.41	3.50	3.54	3.57	[1]
Cucumber	7	1390	192	138	3.62	3.72	3.75	3.78	[2]
Maize	10	2060	2106	1023	3.72	3.81	3.85	3.87	[3]
Rapeseed	18	2520	855	339	3.78	3.87	3.90	3.93	[4]
Rice	12	1530	382	250	3.65	3.74	3.78	3.80	[5]
Tobacco	12	3270	3613	1105	3.84	3.92	3.96	3.98	[6]
Tomato	12	1470	807	549	3.65	3.73	3.77	3.80	[7]
Wheat	21	3140	14547	4633	3.83	3.92	3.95	3.98	[8]
Yeast	16	4900	12	2.5	3.93				[9]

Note: a. n, number of chromosomes in haploid. b. The genetic map length (cM) of each species was from the references listed except for that of yeast, which was calculated by us using the data from [9]. The genome size (Mb) of each species was all from NCBI. c. Corresponding to the overall significance level of 0.05.

References

[1] Garcia-Hernandez,M. et al. (2002) TAIR: a resource for integrated Arabidopsis data. Funct Integr Genomics, 2, 239–253.

[2] Zhou,Q. et al. (2015) A sequencing-based linkage map of cucumber. Molecular Plant, 8, 961–963.

[3] Civardi,L. et al. (1994) The relationship between genetic and physical distances in the cloned a1-sh2 interval of the Zea mays L. genome. Proc Natl Acad Sci U S A, 91, 8268–8272.

[4] Raman,H. et al. (2014) SNP markers-based map construction and genome-wide linkage analysis in Brassica napus. Plant Biotechnology Journal, 12, 851–860.

[5] International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature, 436, 793.

[6] Bindler,G. et al. (2011) A high density genetic map of tobacco (Nicotiana tabacum L.) obtained from large scale microsatellite marker development. Theor Appl Genet, 123, 219.

[7] Shirasawa,K. et al. (2010) SNP Discovery and linkage map construction in cultivated tomato. DNA Research, 17, 381–391.

[8] Yang,Q. et al. (2018) High-density genetic map construction and mapping of the homologous transformation sterility gene (hts) in wheat using GBS markers. BMC Plant Biology, 18, 301.

[9] Bloom,J.S. et al. (2015) Genetic interactions contribute less than additive effects to quantitative trait variation in yeast. Nat Commun, 6, 8712.

A tool for calculating u_α/2 in an F_k population

A tool is provided in tools/cal_ua_fk.R as below for calculating u_α/2 in an F_k population given the values of necessary parameters.

Quick start

# Usage
# Rscript tools/cal_ua_fk.R <population and species information file>
Rscript tools/cal_ua_fk.R configureExample/fk_rice/ua_fk_conf.txt

Example

configureExample/fk_rice/ua_fk_conf.txt

If we know the approximately total genetic distance of this species,

#
precision = 1e-13
k         = 2            # k=0(RIL) k=1(H/DH) k=2(F2) k=3(F3) k=4(F4) k>4(F5,F6...)
#
chrNum         = 12      # chromosome number
geneticLength  = 1530    # Genome size (cM)
#genomeSize    = 382     # Genome size (Mb)
#ratio         = 249.7   # Ratio (kb/cM)

Otherwise,

#
precision = 1e-13
k         = 2            # k=0(RIL) k=1(H/DH) k=2(F2) k=3(F3) k=4(F4) k>4(F5,F6...)
#
chrNum         = 12      # chromosome number
#geneticLength = 1530    # Genome size (cM)
genomeSize     = 382     # Genome size (Mb)
ratio          = 249.7   # Ratio (kb/cM)

It is a key-value file contains required parameters. The separator is "=". The space between key and value will be ignored. There are at least four parameters needed to be set:

Key	Value type	Description	Recommend value
precision	Float	Precision for numerical calculation.	Default: 1e-13
k	Integer	Generation. k=0 means RIL k=1 means H/DH k=2 means F₂ k=3 means F₃ k=4 means F₄ k>4 means F₅, F₆ ..., and is considered as RIL.	Depending on experimental design
chrNum	Integer	Chromosome number.	Depending on species
geneticLength	Integer	Approximately total genetic distance of the species, the unit is cM. Comment this by "#" to use parameters below instead.	Depending on species
genomeSize	Float	Genome size of the species, the unit is Mb. (optional)	Depending on species
ratio	Float	Physical distance/Genetic distance ratio (kb/cM). (optional)	Depending on species

Result

The u_α/2 will be displayed as ua on the screen:

Read global parameters from configureExample/fk_rice/ua_fk_conf.txt .
Parameter precision=1e-13
Parameter k=2
Parameter chrNum=12
Parameter geneticLength=1530

ua is  3.654641

The u_α/2 values in random-mating progeny populations

For the case that the progeny population of a cross between two pure-line parents is not generated by selfing but by random mating, the u_α/2 value will be larger. The following table shows the u_α/2 values of yeast and maize in different random-mating progeny populations, where H₁ (or H) is the gamete (haploid) generated by F₁, and R₁ (or F₂) is the sporophyte generated by combination of F₁ gametes; H₂ is the gamete generated by R₁ (F₂), and R₂ is the sporophyte generated by combination of R₁ (F₂) gametes; the others can be deduced likewise.

The u_α/2 values of yeast and maize in random-mating progeny populations

Species H₁(H)/R₁(F₂) H₂/R₂ H₃/R₃ H₄/R₄ H₅/R₅ H₆/R₆ H₇/R₇ H₈/R₈ H₉/R₉

Yeast 3.93 4.02 4.08 4.13 4.17 4.21 4.24 4.26 4.29

Maize 3.72 3.81 3.88 3.93 3.97 4.01 4.04 4.07 4.09

Species	H₁(H)/R₁(F₂)	H₂/R₂	H₃/R₃	H₄/R₄	H₅/R₅	H₆/R₆	H₇/R₇	H₈/R₈	H₉/R₉
Yeast	3.93	4.02	4.08	4.13	4.17	4.21	4.24	4.26	4.29
Maize	3.72	3.81	3.88	3.93	3.97	4.01	4.04	4.07	4.09

A tool for calculating u_α/2 in an R_k population

A tool is provided in tools/cal_ua_rk.R as below for calculating u_α/2 in a random-mating progeny population given the values of necessary parameters.

Quick start

# Usage
# Rscript tools/cal_ua_rk.R <population and species information file>
Rscript tools/cal_ua_rk.R configureExample/rk_yeast/ua_rk_conf.txt

Example

configureExample/rk_yeast/ua_rk_conf.txt

If we know the approximately total genetic distance of this species,

#
precision = 1e-13
k         = 6
#
chrNum         = 16      # chromosome number
geneticLength  = 4900    # Genome size (cM)
#genomeSize    = 12      # Genome size (Mb)
#ratio         = 2.5     # Ratio (kb/cM)

Otherwise,

#
precision = 1e-13
k         = 6
#
chrNum         = 16      # chromosome number
#geneticLength = 4900    # Genome size (cM)
genomeSize     = 12      # Genome size (Mb)
ratio          = 2.5     # Ratio (kb/cM)

It is a key-value file contains required parameters. The separator is "=". The space between key and value will be ignored. There are at least four parameters needed to be set:

Key	Value type	Description	Recommend value
precision	Float	Precision for numerical calculation.	Default: 1e-13
k	Integer	Generation, e.g. k=6 means R₆ for diploid or H₆ for haploid.	Depending on experimental design
chrNum	Integer	Chromosome number.	Depending on species
geneticLength	Integer	Approximately total genetic distance of the species, the unit is cM. Comment this by "#" to use parameters below instead.	Depending on species
genomeSize	Float	Genome size of the species, the unit is Mb. (optional)	Depending on species
ratio	Float	Physical distance/Genetic distance ratio (kb/cM). (optional)	Depending on species

Result

The u_α/2 will be displayed as ua on the screen:

Read global parameters from configureExample/rk_yeast/ua_rk_conf.txt .
Parameter precision=1e-13
Parameter k=6
Parameter chrNum=16
Parameter geneticLength=4900

recombination rate is  0.02545
Genetic distance is  2.5472  cM
ua is  4.207883

Output files

Result 1 file

This file contains one line to show the theoretical threshold (assuming AF = 0.5 in the population) and a 11 columns table following. The AF is defined by the allele from Parent 1.

The data table described as below:

Column	Heading	Description
1	Chr.	Chromosome
2	Pos.	Block position (bp)
3	AF1-Observed	Observed value of block average allele frequency in pool 1 (AF1)
4	AF1-Expected	Expected value of block average allele frequency in pool 1 (AF1)
5	AF2-Observed	Observed value of block average allele frequency in pool 2 (AF2)
6	AF2-Expected	Expected value of block average allele frequency in pool 2 (AF2)
7	AFD-Observed	Observed value of allele frequency differency (AF1 - AF2)
8	AFD-Expected	Expected value of allele frequency differency (AF1 - AF2)
9	AFP-Observed	Observed value of block average allele frequency in the population (AFP)
10	AFP-Expected	Expected value of Observed value of block average allele frequency in the population (AFP)
11	Sample threshold	Threshold estimated based on the expected AFP

Example

result_random_L/result1.xls

#Theoretical threshold is ± 0.1666
#Chr.   Pos.    AF1-Observed    AF1-Expected    AF2-Observed    AF2-Expected    AFD-Observed    AFD-Expected    AFP-Observed    AFP-Expected    Sample threshold
......
I   178300  NA  0.4655  0.6667  0.4993  NA  0.0409  0.6667  0.4993  0.1666
I   178500  NA  0.4658  NA  0.4993  NA  0.0404  NA  0.4993  0.1666
I   178700  0.5 0.4661  0.4583  0.4993  -0.0417 0.04    0.4583  0.4993  0.1666
I   178900  0.4545  0.4664  NA  0.4992  NA  0.0396  NA  0.4992  0.1666
I   179100  0.4444  0.4667  0.5 0.4992  0.0556  0.0392  0.5 0.4992  0.1666
I   179300  0.6 0.467   0.2727  0.4992  -0.3273 0.0387  0.2727  0.4992  0.1666
......

Result 2 file

This file shows the information of candidate QTLs (significant AFD peaks). It contains six columns. By far, some peaks are needed to filter out manually to get the final candidate QTL list.

Column	Heading	Description
1	Chr.	Chromosome code
2	Pos.	Position of block center
3	Val.	Peak value of AFD
4	Peak Dir.	Peak direction: +, upward; -,downward
5	Start	Start point of confidence interval
6	End	End point of confidence interval

Example

result_random_L/result2.xls

#Chr.   Pos.    Val.    Peak Dir.   Start   End
IV  599459  -0.1914 -   534520  680013
VIII    222625  -0.2848 -   153041  287275
XI  643900  -0.3316 -   618381  643900
XII 655812  0.2102  +   550089  733741
XIII    9900    -0.2488 -   9900    108758
XIV 377027  0.273   +   321716  450359

Q&A

If I have the VCF file generated by Freebayes/GATK, how can I convert the VCF format into BSA format?

A perl script is provided for transforming the VCF format into BSA format.

Quick start

  # Usage:
  # perl tools/vcf2bsa.pl <samples information file> <markers.vcf> <output file>
  perl tools/vcf2bsa.pl configureExample/vcf2bsa/vcf2bsa_conf.txt dataExample/vcf2bsa/markers.freebayes.vcf result/markers.bsa

Example

configureExample/vcf2bsa/vcf2bsa_conf.txt

# Samples in 2x2 Table
# Bulk/Pool 1: the first sample
Table2x2.pool1 = low-pool-RG
# Bulk/Pool 2: the second sample
Table2x2.pool2 = random-pool-RG
# Parent 1
Table2x2.parent1 = P1-RG
# Parent 2
Table2x2.parent2 = P2-RG

It's a key-value file contains samples relationship. The separator is "=". And the space between key and value will be ignored. There are four parameters needed to be set:

Key	Value type	Description
Table2x2.pool1	String	Design A: high selected pool RG tag in VCF file. Design B: selected pool RG tag in VCF file.
Table2x2.pool2	String	Design A: low selected pool RG tag in VCF file. Design B: random pool RG tag in VCF file.
Table2x2.parent1	String	High phenotype value parent sample RG tag in VCF file.
Table2x2.parent2	String	Low phenotype value parent sample RG tag in VCF file.

RG tag is the “Read Group” setting at the reads mapping step. For example, the -R parameter value in BWA.

If one parent is missing, the script will consider the genotype different from the known parent sample as the other parent genotype. If both parents are missing, the script will consider the reference genotype as the known parent genotype.

Why does the output show that the data size of one chromosome is 0?

If the chromosome code in the data file and the chromosome code in the chromosome length file are not the same, such report will be shown. The chromosome codes in the two files should be exactly the same. It is case sensitive.

back to top
What is the suitable block size?

We recommend a physical distance approximately equivalent to a genetic distance of 0.1 cM as the block size. For example, in yeast, 1 cM is approximately equivalent to 2.5 kb on average, so we choose 0.2 kb as the block size. In rice, 1 cM is about equivalent to 250 kb, so we choose 20 kb as the block size.

back to top

huanglikun / BRM

readme

Block Regression Mapping (BRM)

Content

Introduction

Getting started

Input data file

Example

Input configuration files

Chromosome length file

Example

Block regression mapping configuration file

Example

version 0.3

Experiment

Block regression

output file determination (optional)

Result1_File = result/result1.xls # including allele frequency and threshold

Result2_File = result/result2.xls # including peak information and confidence interval

About u_α/2

The u_α/2 values in various populations

The u_α/2 values of some species in various populations

A tool for calculating u_α/2 in an F_k population

Quick start

Result

The u_α/2 values in random-mating progeny populations

The u_α/2 values of yeast and maize in random-mating progeny populations

A tool for calculating u_α/2 in an R_k population

Quick start

Example

Result

Output files

Result 1 file

Example

Result 2 file

Example

Q&A

Quick start

Example

huanglikun / BRM

readme

Block Regression Mapping (BRM)

Content

Introduction

Getting started

Input data file

Example

Input configuration files

Chromosome length file

Example

Block regression mapping configuration file

Example

version 0.3

Experiment

Block regression

output file determination (optional)

Result1_File = result/result1.xls # including allele frequency and threshold

Result2_File = result/result2.xls # including peak information and confidence interval

About uα/2

The uα/2 values in various populations

The uα/2 values of some species in various populations

A tool for calculating uα/2 in an Fk population

Quick start

Result

The uα/2 values in random-mating progeny populations

The uα/2 values of yeast and maize in random-mating progeny populations

A tool for calculating uα/2 in an Rk population

Quick start

Example

Result

Output files

Result 1 file

Example

Result 2 file

Example

Q&A

Quick start

Example

About u_α/2

The u_α/2 values in various populations

The u_α/2 values of some species in various populations

A tool for calculating u_α/2 in an F_k population

The u_α/2 values in random-mating progeny populations

The u_α/2 values of yeast and maize in random-mating progeny populations

A tool for calculating u_α/2 in an R_k population