vtsyvina / CliqueSNV

MIT License
21 stars 5 forks source link
haplotypes illumina ngs pacbio quasispecies

CliqueSNV

How to Run

Download jar from here (latest ver 2.0.3, December 2021)

How to build

Only if you want to modify the program. Otherwise, use the jar file provided

mvn clean install

It will create clique-snv.jar in the root folder

Citation

Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res. 2021 Jul 2:gkab576. doi: 10.1093/nar/gkab576. Epub ahead of print. PMID: 34214168.

https://pubmed.ncbi.nlm.nih.gov/34214168/

Stable releases

1.4.1 - 12 February 2018

1.4.11 - 2 January 2020

1.5.3 - 1 June 2020

2.0.3 - 2 December 2021

Parameters

There are several available parameters:

Output parameters

t and tf parameters choice

These two parameters are significant, since they put a border in trade-off between precision and recall. By default, they are set to detect moderate haplotypes (>5%). If it is know that data is not very noisy and variants with frequency >1% are of interest, then -tf should be around 0.01, -t is optional and based on coverage.

Difference between -sp -ep and -os -oe parameters

The (-sp,-ep) range doesn't affect the output, it affects only the region where the tool will work. And (-os, -oe) will cut the output without any affect on the program workflow. Example: if there are four haplotypes (and none of sp,ep,os,oe parameters specified):

AAAAAA
ACAAAC
AACCAA
AATGAA

(-sp=2, -ep=4). Output (second one disappear since SNPs are out of range and it won't be discovered):

AAAAAA
AACCAA
AATGAA

(-sp=2, -ep=4, -os=2, -oe=5):

AAA
ACC
TGA

(-os=2, -oe=5). It will only cut the haplotypes so you will see the duplicate here:

AAA
AAA
CCA
TGA

Usage example

java -jar clique-snv.jar -m snv-pacbio

java -jar clique-snv.jar -m snv-illumina(unzip sam file beforehand from 'data' folder)

java -jar clique-snv.jar -m snv-illumina -in /path/to/data/r.sam -log

java -jar clique-snv.jar -m snv-illumina-vc -in /path/to/data/r.sam -outDir vcf_out/ -t 10 -tf 0.00034 -threads 8 -log

Example datasets

There are two example datasets:

data/flu_ref.fasta contains those haplotypes as ground truth

How to run:

java -jar clique-snv.jar -m snv-pacbio -log -in data\PacBio_reads\reads.sam

java -jar clique-snv.jar -m snv-illumina -in data\Illumina_reads\reads.sam

Memory usage

From our experience the tool consumes around 10Gb(upper bound estimate) of RAM per 1,000,000 input reads(may vary based on a number of factors). To change standard JVM heap size limit specify -Xmx flag. Example with 50Gb:

java -Xmx50G -jar clique-snv.jar -m snv-illumina -in data\Illumina_reads\reads.sam

Output

For default quasispecies problem As output CliqueSNV has two files: json and fasta. Json file has the info of used parameters, CliqueSNV version, found haplotypes:

{
  "version": "1.5.5",
  "settings": {
    "-m": "snv-pacbio",
    "-log": "true",
    "-in": "data\\PacBio_reads\\reads.sam",
    "-t": "10",
    "-rn": "true",
    "-tf": "0.0001"
  },
  "error": "none",
  "foundHaplotypes": 10,
  "haplotypes": [
    {
      "frequency": 0.5275842396392558,
      "name": "\u003e0_fr_0.5275842396392558",
      "snps": "[]",
      "sourceClique": "[]",
      "haplotype": "GGAAAGAATAAAAGAACTAAGGAATCTAA..."
    },
    {
      "frequency": 0.23674173677646565,
      "name": "\u003e1_fr_0.23674173677646565",
      "snps": "GT-TTATTAC[31, 265, 288, 396, 617, 747, 997, 1120, 1147, 2013]",
      "sourceClique": "GT-TTATTAC[31, 265, 288, 396, 617, 747, 997, 1120, 1147, 2013]",
      "haplotype": "GGAAAGAATAAAAGAACTAAGGAATCTAA..."
    },
 ...

Fasta file will be:

>1_fr_0.5820184401895632
CCACAGCACGCAGATTGGTGGAATAAGGATGGTAAACATCCTTAGGCAGAACCC....

>2_fr_0.24979076133465727
CCACAGCACGCAGATTGGTGGAATAAGGATGGTAAACATCCTTAGGCAGAACCC...
 ...

Where name is just an index + haplotype frequency

For Variant Calling problem program produces standard VCF file. Standard is described here

New in version 2.0.0 (sliding window approach)

CliqueSNV could perform poorly for samples with long references and significant diversity because of many linked SNPs for Illumina data. To address it in version 2.0.0, we introduced a new approach to build haplotypes in windows, increasing the size of the window on each iteration. The window size is equal to the sample's fragment length (usually 250 for single-end reads and around 500 for pair-end reads). The haplotypes reconstruction starts in the region of the highest coverage end extend the window to the left or right based on coverage.

This approach allows CliqueSNV to work on more diverse samples or samples with longer reference - we tested the new version on benchmarks with a reference length of 5000-10000, and CliqueSNV was able to reconstruct ground truth haplotypes successfully.

Versions:

1.1.0 - add allele frequency for variant calling

1.2.0 - new cliques merging strategy; change true frequency estimator

1.3.0

1.3.1 - parallel execution for Illumina input preprocessing

1.3.2

1.4.0

1.4.1

1.4.2

1.4.3

1.4.4

1.4.5

1.4.6

1.4.7

1.4.8

1.4.9

1.4.10

1.4.11

1.5.0

1.5.1

1.5.2

1.5.3

1.5.4

1.5.5

1.5.6

1.5.7

2.0.0

2.0.3

Any questions

With any questions. please, contact: v.tsyvina@gmail.com