hattori-lab / SACRA

SACRA (Split Amplified Chimeric Read Algorithm) is the algorithm for correcting chimeric long-reads generated by MDA.
MIT License
3 stars 1 forks source link

SACRA

Split Amplified Chimeric Read Algorithm
SACRA splits the chimeric reads to the non-chimeric reads in long reads of MDA-treated sample.

Dependencies

last (version 963). http://last.cbrc.jp/

seqkit (version: 0.8.0). https://bioinf.shenwei.me/seqkit/usage/

seqtk (version: 1.2-r102-dirty). https://github.com/lh3/seqtk

Workflow of SACRA

SACRA operates in five phases: 1. alignment, 2. pars depth, 3. cal pc ratio, 4. cal mPC ratio, and 5. split.

STEP 1. Alignment

SACRA performs all vs all pairwise alignment of input long-read by LAST aligner for constructing aligned read clusters (ARCs). For obtaining better performance of SACRA, input long-read needs to be relatively high accurate by error-correction by some tools (e.g. MHAP of canu, HiFi reads of PacBio, etc.). In the original paper, error-corrected long reads had relatively high accuracy with 97% on average. This process takes a time, so we recommend using multithreads.

STEP 2. PARs depth

Detect the partially aligned reads (PARs) and candidate chimeric positions from the alignment result of STEP 1, and obtain the depth of PARs at that positions.

STEP 3. Caluculate PC ratio

Calculate the depth of continuously aligned reads (CARs) and the PARs/CARs ratio (PC ratio) at the candidate chimeric positions.

STEP 4. Calculate mPC ratio

Calculate the mPC ratio based on the provided spike-in reference genome. Even without a spike-in genome, mPC=10 can be applied to remarkable reduction of chimeric reads.

STEP 5. Split

Split the chimeric read at the chimeric positions detected by STEP 3.

Installation

git clone https://github.com/hattori-lab/SACRA.git
export PATH=$PATH:/path_on_your_system/SACRA/scripts/

Usage

Run the below command in the directory containing the config.yml.

sh SACRA.sh [-i <input fasta file>] [-p <prefix>] [-t <max no. of cpu cores>] [-c <config.yml>]

Config file

All parameters of four steps are able to change by editting the config.yml.

---

alignment:
  R: "01"           : Specify lowercase-marking of lastdb.
  u: "NEAR"         : Specify a seeding scheme of lastdb.
  a: 0              : Gap existence cost.
  A: 10             : Insertion existence cost.
  b: 15             : Gap extension cost.
  B: 7              : Insertion extension cost.
  S: 1              : Specify how to use the substitution score matrix for reverse strands.
  f: "BlastTab+"    : Output format of LAST. SACRA accepts only BlastTab+ format.

parsdepth:
  al: 100           : Minimum alignment length.
  tl: 50            : Minimum terminal length of unaligned region of PARs.
  pd: 1             : Minimum depth of PARs.
  id: 75            : Alignment identity threshold of PARs.

pcratio:
  ad: 50            : Minimum length of alignment start/end position from candidate chimeric position.
  id: 75            : Alignment identity threshold of CARs.

mpc:
  sp: "false"       : Whether the mPC ratio is calculated based on the spike-in reference genome or not.
  rf: "lambda.fasta": PATH to the spike-in reference genome.
  R: "01"           : Specify lowercase-marking of lastdb.
  u: "NEAR"         : Specify a seeding scheme of lastdb.
  a: 8              : Gap existence cost.
  A: 16             : Insertion existence cost.
  b: 12             : Gap extension cost.
  B: 5              : Insertion extension cost.
  S: 1              : Specify how to use the substitution score matrix for reverse strands.
  f: "BlastTab+"    : Output format of LAST. SACRA accepts only BlastTab+ format.
  id: 95            : Alignment identity threshold.
  al: 50            : Minimum alignment length.
  lt: 50            : Threshold of the unaligned length for detecting chimeric reads. 

split:
  pc: 10            : Minimum PC ratio (%).
  dp: 0             : Minimum depth of PARs + CARs.
  sl: 100           : Sliding windows threshold.

Output

pcratio: The results of PC ratio caluculation. The output is tab deliminated file containing six columns. 1. sequence id, 2. read length, 3. candidate chimeric position, 4. depth of PARs, 5. depth of CARs, 6. PC ratio (%).
non_chimera.fasta: Non-chimeric reads.
split.fasta: Splitted reads.
output.fasta: Final sequences combining non-chimeric and split reads.

Citation

Yuya Kiguchi, Suguru Nishijima, Naveen Kumar, Masahira Hattori, Wataru Suda, Long-read metagenomics of multiple displacement amplified DNA of low-biomass human gut phageomes by SACRA pre-processing chimeric reads, DNA Research, Volume 28, Issue 6, December 2021, dsab019, https://doi.org/10.1093/dnares/dsab019

Docker Image

TBA