The GMASS score is a novel measure for representing structural similarity between two assemblies. It represents the structural similarity of a pair of assemblies based on the length and number of similar genomic regions defined as consensus segment blocks (CSBs) in the assemblies.
git clone https://github.com/jkimlab/GMASS.git
cd GMASS
perl setup.pl install
perl calculateGMASS.pl -p example/params.txt -o outdir
To install the package for calculating GMASS score,
git clone https://github.com/jkimlab/GMASS.git
cd GMASS
perl setup.pl install
User can test whether the package is installed properly as
perl calculateGMASSS.pl -p example/params.txt -o outdir
User can also uninstall this package as
perl setup.pl uninstall
calculateGMASS.pl [options] -f1 <fasta1> -f2 <fasta2> -r <resolutions> -s <dist>
-Inputs:
-f1/-f2 Uncompressed sequence files in fasta format
-r|--resolution Comma-separated resolution list
-s|--strict Alignment strictness [self|near|medium|far] (Default: near)
-Options:
-c|--core Core number (Default: 1)
-o|--outdir Path of output directory (Default: Current directory)
-h|--help Print help message
You can offer the input parameters to the package using a file as
calculateGMASS.pl [options] -p <params>
-f1/-f2: \<fasta>
Uncompressed sequence files of a pair of genome assemblies compared. The files must be written in fasta format. The details of fasta files can be confirmed from https://en.wikipedia.org/wiki/FASTA_format.
Providing assembly by -f1 does not always mean the assesmbly is used as reference assembly. The assembly used as reference or target assembly will be automatically decided by N50 size. N50 size is calculated by the package, so user don't have to do.
-r: \<resolutions>
Comma-seperated resolution list. Each resolution will be used for setting minimum CSB size. The package requires resolution values at least two.
-s: \<dist>
User can adjust the strictness for alignment. There are 4 options, 'self', 'near', 'medium', 'far', and the default is 'near'.
Example of usage
- self: Comparng different versions of human genome assemblies
- near: Comparing human genome assembly to chimpanzee genome assembly
- medium: Comparing human genome assembly to mouse genome assembly
- far: Comparing human genome assembly to chicken genome assembly
-p: \<param>
A file containing input parameters.
# Assemblies compared
### file format should be fasta format
$ASSEMBLY1=example/Assembly1.fa
$ASSEMBLY2=example/Assembly2.fa
# Resolution lists (Comma-separated)
$RESOLUTION=10000,20000,30000,40000,50000
# Alignment strictness
### Option: self, near(default), medium, far
$STRICT=near
If the packages run successfully, The output directory looks like this.
aln.log assembly_CSB.stats.txt chainNet/ CSB/ data/ scores.txt
aln.log
Log file of alignment run
assembly_CSB.stats.txt
The file containing the features of assemblies and CSB in each resolution
Description
- AS_count: The number of scaffolds (chromosomes) in reference/target assembly
- AS_length: Total size of scaffolds (chromosomes) in reference/target assembly
- usedAS_count: The number of scaffolds being used for CSBs in reference/target assembly
- usedAS_length: Total size of scaffolds being used for CSBs in reference/target assembly
- CSB_count: The number of CSBs constructed between the assembly pair
- CSB_length: Total size of CSBs constructed between the assembly pair
Example
#stats 10000 20000 30000 40000 50000 AS_count(ref) 19 19 19 19 19 AS_count(tar) 17 17 17 17 17 AS_length(ref) 2880676 2880676 2880676 2880676 2880676 AS_length(tar) 2862930 2862930 2862930 2862930 2862930 CSB_count(ref) 5 5 5 4 4 CSB_count(tar) 5 5 5 4 4 CSB_length(ref) 2837645 2837645 2837645 2799734 2799734 CSB_length(tar) 2828992 2828992 2828992 2791108 2791108 usedAS_count(ref) 5 5 5 4 4 usedAS_count(tar) 5 5 5 4 4 usedAS_length(ref) 2855326 2855326 2855326 2817019 2817019 usedAS_length(tar) 2828992 2828992 2828992 2791108 2791108
chainNet/
Directory containing alignment results
CSB/
Directory containing constructed CSB in each resolution
data/
Direcotory for package input data
scores.txt
The file containing GMASS score as well as Ci, Si, and Li scores in each resolution
Example
#Resolution Li score Ci score Si score 10000 0.996889512514958 1 0.996889512514958 20000 0.996889512514958 1 0.996889512514958 30000 0.996889512514958 1 0.996889512514958 40000 0.996917865804394 1 0.996917865804394 50000 0.996917865804394 1 0.996917865804394
-- GMASS 0.996900853830732
Kwon D, Lee J, Kim J. GMASS: a novel measure for genome assembly structural similarity. BMC Bioinformatics. 2019 Mar 18;20(1):147. doi: 10.1186/s12859-019-2710-z.
bioinfolabkr@gmail.com