twolinin / longphase

GNU General Public License v3.0
98 stars 6 forks source link

LongPhase

LongPhase is an ultra-fast program for simultaneously co-phasing SNPs, small indels, large SVs, and (5mC) modifications for Nanopore and PacBio platforms. It can produce nearly chromosome-scale haplotype blocks by using Nanpore ultra-long reads without the need for additional trios, chromosome conformation, and strand-seq data. LongPhase can phase a 30x human genome in ~1 minute (see Speed).


wget https://github.com/twolinin/longphase/releases/download/v1.7.3/longphase_linux-x64.tar.xz
tar -xJf longphase_linux-x64.tar.xz

An executable file, longphase_linux-x64, can be executed directly. If you need to compile a local version, you can clone and compile using the following commands, and make sure that the environment has zlib installed. If you require setting up a virtual environment, we also provide a Dockerfile.

git clone https://github.com/twolinin/longphase.git
cd longphase
autoreconf -i
./configure
make -j 4

Usage

Phase command

SNP-only phasing

For SNP-only phasing, the input of LongPhase consists of SNPs in VCF (e.g., SNP.vcf), an indexed reference in Fasta (e.g., reference.fasta, reference.fasta.fai), and one (or multiple) indexed read-to-reference alignment in BAM (e.g., alignment1.bam, alignment1.bai, alignment2.bam, ...) (see Input Preparation). The users should specify the sequencing platform (--ont for Nanopore and --pb for PacBio). An example of SNP phasing usage is shown below.

longphase phase \
-s SNP.vcf \
-b alignment1.bam \
-b alignment2.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi

SNP and indel co-phasing

When the SNP.vcf file contains both SNP and small insertion/deletion variations (indels), use the --indels parameter for co-phasing SNPs and indels.

longphase phase \
-s SNP.vcf \
-b alignment.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--indels \
--ont # or --pb for PacBio Hifi

SNP and SV co-phasing

When co-phasing SNPs and SVs, except for the same input (i.e., SNPs, reference, and alignments), LongPhase takes an extra input of called SVs in VCF (e.g., SV_sniffles.vcf), which should be generated by Sniffles (with --num_reads_report in sniffles1 and --output-rnames in sniffles2) or CuteSV (with --report_readid--genotype)(see Input Preparation).

longphase phase \
-s SNP.vcf \
--sv-file SV.vcf \
-b alignment.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi

SNP and modification co-phasing

When co-phasing SNPs and modifications (5mC supported at this moment), it is necessary to first call the modified loci (in VCF) using modcall and then activate SNP-modification co-phasing via specifying the --mod-file with the modcall-generated VCF. The phased SNPs and modifications will be written into new VCFs accordingly.

longphase phase \
-s SNP.vcf \
--mod-file modcall.vcf \
-b alignment.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi

The complete list of phase parameters

Usage:  phase [OPTION] ... READSFILE
   --help                                 display this help and exit.

require arguments:
   -s, --snp-file=NAME                    input SNP vcf file.
   -b, --bam-file=NAME                    input bam file.
   -r, --reference=NAME                   reference fasta.
   --ont, --pb                            ont: Oxford Nanopore genomic reads.
                                          pb: PacBio HiFi/CCS genomic reads.

optional arguments:
   --sv-file=NAME                         input SV vcf file.
   --mod-file=NAME                        input modified vcf file.(produce by longphase modcall)
   -t, --threads=Num                      number of thread. default:1
   -o, --out-prefix=NAME                  prefix of phasing result. default: result
   --indels                               phase small indel. default: False
   --dot                                  each contig/chromosome will generate dot file.

parse alignment arguments:
   -q, --mappingQuality=Num               filter alignment if mapping quality is lower than threshold. default:1

phasing graph arguments:
   -p, --baseQuality=[0~90]               change edge's weight to --edgeWeight if base quality is lower than the threshold. default:12
   -e, --edgeWeight=[0~1]                 if one of the bases connected by the edge has a quality lower than --baseQuality
                                          its weight is reduced from the normal 1. default:0.1
   -a, --connectAdjacent=Num              connect adjacent N SNPs. default:20
   -d, --distance=Num                     phasing two variant if distance less than threshold. default:300000
   -1, --edgeThreshold=[0~1]              give up SNP-SNP phasing pair if the number of reads of the 
                                          two combinations are similar. default:0.7
haplotag read correction arguments:
   -m, --readConfidence=[0.5~1]           The confidence of a read being assigned to any haplotype. default:0.65
   -n, --snpConfidence=[0.5~1]            The confidence of assigning two alleles of a SNP to different haplotypes. default:0.75

Output of SNP and indel phasing

When phasing SNPs alone, LongPhase outputs the results into a VCF file. The alleles of the two haplotypes are stored in the GT field (e.g., 1|0), whereas the left and right alleles of the vertical bar represent the paternal or maternal haplotypes. The last PS field (e.g., 16809) represents the identifier of the block. For instance, the following example illustrates two haplotypes of five-phased SNPs, CCCCC and GATGT, in the same block 16809. The output of phased indels is similar and in the same VCF.

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  default
1       16809   .       C       G       8.4     PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:8:79:51,28:0.35443:7,0,16:16809
1       16949   .       A       C       8.4     PASS    .       GT:GQ:DP:AD:VAF:PL:PS   0|1:8:67:43,21:0.313433:7,0,25:16809
1       21580   .       C       T       13.9    PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:14:75:50,24:0.32:13,0,30:16809
1       23359   .       C       G       13.2    PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:13:52:24,18:0.346154:13,0,42:16809
1       24132   .       C       T       11.1    PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:11:63:41,17:0.269841:10,0,29:16809

Output files of SNP and SV co-phasing

When co-phasing SNPs and SVs, two VCFs (one for SNPs and one for SVs) are outputted. Similarly, the phased SVs are stored in the GT field and the block ID is in the PS field. For instance, the following example illustrates two haplotypes of five SNPs and two SVs, A\<INS>G\<noSV>TCC and G\<noSV>A\<INS>ATT, which are co-phased in the same block 382189.

An example of SNP VCF file

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  default
1       465289  .       A       G       34.2    PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:4:16:2,11:0.6875:31,0,1:382189
1       544890  .       G       A       3.1     PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:3:31:30,0:0:0,0,31:382189
1       545612  .       T       A       6.8     PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:7:37:28,8:0.216216:5,0,32:382189
1       545653  .       C       T       14      PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:14:44:30,14:0.318182:13,0,26:382189
1       561458  .       C       T       5.1     PASS    .       GT:GQ:DP:AD:VAF:PL:PS   1|0:5:17:14,0:0:3,0,17:382189

An example of SV VCF file

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  default
1       534057  7       N       AAATTCGCGCATATCACGGGTGCCGCCTCTGTGCAGCTCACGAAACGCCATACTACGGTGCTGCTCAGCAGCTACGGAATCGCTATACCTACGCGAGCTGCCTCAGCAGCCAC       .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=534128;STD_quant_start=25.369041;STD_quant_stop=29.191054;Kurtosis_quant_start=-0.239012;Kurtosis_quant_stop=-1.333733;SVTYPE=INS;RNAMES=0120d560-50f0-4298-8b03-7bd30f3cf139,030ac5d4-e616-4ce9-8ad3-243835335085,0cf1b0d9-2b4d-463d-a658-01b4b040dc63,1c9e982a-8af7-4ba0-8cc2-154a679c72e2,22e11f79-0067-4735-8b69-97d951ca702f,2ca8a6f4-be9d-4df5-80d2-dc1743f97a84,35dff960-22b6-4216-af69-8878b8860362,390d2fb4-9224-41a1-a9fe-6cb3bbe4273a,3e333422-12ca-4f16-afb8-ed7611dcbc2c,3e8ed78a-b857-4941-bbc1-52ca51e26c08,4191371c-49ea-466d-aadc-06f27cdf1050,4aaae789-54fe-4fa5-84b3-5524dc2b3796,581e0cfb-2491-44d7-a2e1-ba1516ba0f2f,59749531-9abf-4ff4-a4a1-31484ba3d32d,5c97b0a9-925e-4153-952d-0f437171d3dc,6067590e-956c-442f-bbb7-cae597d616ad,623804bb-e2fe-415d-96ae-3d06aec63e5d,672244ce-2d5d-45cf-beb2-ddeddae917e8,6b79aa23-7c9c-49dc-9b88-8419c88c7a36,6e60d235-6654-4ef8-9feb-70f12a397721,6fbb55c5-57fc-43bc-8a24-b0058778054c,8e10bf13-9674-489c-924e-182a42e08a34,aa6ba092-4221-4d54-8819-811448c34983,af2169b3-b308-4db5-9675-15ff5f68d8dd,b214fcbd-77de-4dd5-84db-6d2b7e1f3158,c140eaba-e0e7-44e7-9f16-c8c67fd4a2f2,c7835cf7-44c0-44da-b10e-b2468fc8caab,ca4aa84d-34d1-4639-8634-b6a5540129ca,caba4bde-cdc5-4344-9803-a3c158525b0c,e0747feb-60bb-40db-a144-a9b43dd13256,e6992c7d-c00e-40e7-b80b-562094a9b60f,e8bb376c-20e0-4bed-a61f-b82b5c37ef6f,f3242a61-deec-49e7-b99f-335a1ba13791,f87dfdf7-7b68-421b-b395-3769a5fa3ac1,f91a7627-7fdb-4f03-8f33-0ed1649d96fe;SUPTYPE=AL;SVLEN=43;STRANDS=+-;RE=35;REF_strand=44;AF=0.795455    GT:DR:DV:PS     0|1:9:35:382189
1       545892  8       N       ACACGCGGGCCGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCC   .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=545893;STD_quant_start=28.919840;STD_quant_stop=28.543200;Kurtosis_quant_start=-0.382251;Kurtosis_quant_stop=-0.130808;SVTYPE=INS;RNAMES=0120d560-50f0-4298-8b03-7bd30f3cf139,030ac5d4-e616-4ce9-8ad3-243835335085,0cf1b0d9-2b4d-463d-a658-01b4b040dc63,22e11f79-0067-4735-8b69-97d951ca702f,2ca8a6f4-be9d-4df5-80d2-dc1743f97a84,3977c988-9901-4e5b-9f9c-b8ebfcce8e93,3e333422-12ca-4f16-afb8-ed7611dcbc2c,4191371c-49ea-466d-aadc-06f27cdf1050,4aaae789-54fe-4fa5-84b3-5524dc2b3796,5933e1b7-1aeb-4437-a875-3befbf703420,623804bb-e2fe-415d-96ae-3d06aec63e5d,672244ce-2d5d-45cf-beb2-ddeddae917e8,6b79aa23-7c9c-49dc-9b88-8419c88c7a36,7842d9f1-9a77-4c9a-ab5b-5a644ed2d355,7ba26d64-d9b0-475f-8d5f-1fa73fc42d93,8e10bf13-9674-489c-924e-182a42e08a34,a2b1b2ef-1e28-465e-8b3f-c44e15990d8b,a45514f1-4aae-40eb-94eb-2969722a7b05,b8181546-6839-49cd-b64f-b65c96369a2b,c140eaba-e0e7-44e7-9f16-c8c67fd4a2f2,c7835cf7-44c0-44da-b10e-b2468fc8caab,ca4aa84d-34d1-4639-8634-b6a5540129ca,d56f0abe-4389-4197-a151-0eb567fb99f0,e6992c7d-c00e-40e7-b80b-562094a9b60f,e8bb376c-20e0-4bed-a61f-b82b5c37ef6f,ec325153-0c55-4ece-8f3c-c432701e6750,f3242a61-deec-49e7-b99f-335a1ba13791,f91a7627-7fdb-4f03-8f33-0ed1649d96fe;SUPTYPE=AL;SVLEN=62;STRANDS=+-;RE=28;REF_strand=51;AF=0.54902        GT:DR:DV:PS     1|0:23:28:382189

Haplotag command

This command tags (assigns) each read (in BAM) to one haplotype in the phased SNP/SV VCF. i.e., reads will be tagged as HP:i:1 or HP:i:2. In addition, the haplotype block of each read is stored in the PS tag. The phased VCF can be also generated by other programs as long as the PS or HP tags are encoded. The author can specify --log for additionally output a plain-text file containing haplotype tags of each read without parsing BAM.

longphase haplotag \
-r reference.fasta \
-s phased_snp.vcf \
--sv-file phased_sv.vcf \
-b alignment.bam \
-t 8 \
-o tagged_bam_prefix

The complete list of haplotag parameters

Usage:  haplotag [OPTION] ... READSFILE
      --help                          display this help and exit.

require arguments:
      -s, --snp-file=NAME             input SNP vcf file.
      -b, --bam-file=NAME             input bam file.
      -r, --reference=NAME            reference fasta.
optional arguments:
      --tagSupplementary              tag supplementary alignment. default:false
      --sv-file=NAME                  input phased SV vcf file.
      --mod-file=NAME                 input a modified VCF file (produced by longphase modcall and processed by longphase phase).
      -q, --qualityThreshold=Num      not tag alignment if the mapping quality less than threshold. default:1
      -p, --percentageThreshold=Num   the alignment will be tagged according to the haplotype corresponding to most alleles.
                                      if the alignment has no obvious corresponding haplotype, it will not be tagged. default:0.6
      -t, --threads=Num               number of thread. default:1
      -o, --out-prefix=NAME           prefix of phasing result. default:result
      --region=REGION                 tagging include only reads/variants overlapping those regions. default:(all regions)
                                      input format:chrom (consider entire chromosome)
                                                   chrom:start (consider region from this start to end of chromosome)
                                                   chrom:start-end
      --log                           an additional log file records the result of each read. default:false

Modcall command

Modcall is a module for calling allele-specific modifications in the latest Nanopore and PacBio basecalled reads, assuming the modifications are stored in the (aligned) BAM files using MM and ML tags. These tags are stored in the unaligned BAM outputted by the latest ONT/PacBio basecallers, which can be carried to the aligned BAM through proper preprocessing. Modcall identifies allele-specific modifications and stores them in a VCF file. An example of this command is shown below.

longphase modcall \
-b alignment.bam \
-r reference.fasta \
-t 8 \
-o modcall

An output example of a modcall-generated VCF is shown below, which can be provided to longphase for co-phasing with other variants. The phased modifications are written into a new VCF same as SNPs.

##INFO=<ID=RS,Number=.,Type=String,Description="Read Strand">
##INFO=<ID=MR,Number=.,Type=String,Description="Read Name of Modified position">
##INFO=<ID=NR,Number=.,Type=String,Description="Read Name of nonModified position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MD,Number=1,Type=Integer,Description="Modified Depth">
##FORMAT=<ID=UD,Number=1,Type=Integer,Description="Unmodified Depth">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    11200   .       C       .       .       .       RS=P;MR=eb459876-8c81-4714-a496-a90ea8be94d2,6ca3a71f-62fd-416e-8c6e-8c4a9c054e1a,0d8b7c68-d98d-4045-a572-82fedac62da5,4ee6d422-e677-41fc-9ef3-8b387c05973a,71b5dfe9-7cd1-4959-b634-b9d162468edb,6e205f04-560a-4975-932d-dbd60ead695d,ae2238f8-b622-4cb5-8f02-4c3f54ab8ca3,d7ef87d0-bffe-404d-9f10-62eceb0c5121,c0e0c526-e193-4ee2-81d6-1fbe0c970dc1;NR=8249c7c7-fd04-4a4c-985d-2bbcb2030bc4,5db6fcd3-5780-494b-bfdd-4d9d7282d012,e3b03dc0-8399-4e1e-bd0d-e5e4e3e2911a,b8dc7af8-9f78-4dac-b8cc-35e157c51621,0b2638c1-8380-48b4-b08c-a6161495ad9d,7b1e5c16-f0a6-47f6-9726-247c762a10ca,ac7ae685-082e-413e-b5a6-b4c12d49a1c2;     GT:MD:UD:DP     0/1:9:7:16
chr1    11201   .       G       .       .       .       RS=N;MR=7f505676-704b-41b7-a292-666559b40159,7b070045-225e-49ed-9fff-e7cd1bffdfd5,e7556960-d21d-4859-931c-41c6bc196141,69c925f2-ae2c-4cf1-a8c6-6b39868f0d86,7e6a4472-5f8d-4d20-a70b-ec9db403121b,5f879a79-5b28-4a58-b12a-8445c44a37aa;NR=123518e1-4dbe-4a72-9b31-3450b6b8d3da,1783e61c-546a-4414-a8ef-92297ee1dc1f,c9772456-bb97-41c8-b561-001cbcc1233f,f40a87f4-0240-4137-b3ed-b98ba5a479cb,3c6d0a5e-23a5-4a25-9b3b-ed90b19188a0,d584347d-6db5-42bf-8cd8-95ac8a2c51d1; GT:MD:UD:DP     0/1:6:6:12
chr1    11434   .       C       .       .       .       RS=P;MR=eb459876-8c81-4714-a496-a90ea8be94d2,6ca3a71f-62fd-416e-8c6e-8c4a9c054e1a,0d8b7c68-d98d-4045-a572-82fedac62da5,71b5dfe9-7cd1-4959-b634-b9d162468edb,5db6fcd3-5780-494b-bfdd-4d9d7282d012,6e205f04-560a-4975-932d-dbd60ead695d,ae2238f8-b622-4cb5-8f02-4c3f54ab8ca3,d7ef87d0-bffe-404d-9f10-62eceb0c5121;NR=8249c7c7-fd04-4a4c-985d-2bbcb2030bc4,e3b03dc0-8399-4e1e-bd0d-e5e4e3e2911a,b8dc7af8-9f78-4dac-b8cc-35e157c51621,0b2638c1-8380-48b4-b08c-a6161495ad9d,7b1e5c16-f0a6-47f6-9726-247c762a10ca,ac7ae685-082e-413e-b5a6-b4c12d49a1c2,c0e0c526-e193-4ee2-81d6-1fbe0c970dc1;  GT:MD:UD:DP     0/1:8:7:16
chr1    11435   .       G       .       .       .       RS=N;MR=7f505676-704b-41b7-a292-666559b40159,7b070045-225e-49ed-9fff-e7cd1bffdfd5,e7556960-d21d-4859-931c-41c6bc196141,7e6a4472-5f8d-4d20-a70b-ec9db403121b,5f879a79-5b28-4a58-b12a-8445c44a37aa;NR=123518e1-4dbe-4a72-9b31-3450b6b8d3da,1783e61c-546a-4414-a8ef-92297ee1dc1f,c9772456-bb97-41c8-b561-001cbcc1233f,69c925f2-ae2c-4cf1-a8c6-6b39868f0d86,f40a87f4-0240-4137-b3ed-b98ba5a479cb,d584347d-6db5-42bf-8cd8-95ac8a2c51d1;      GT:MD:UD:DP     0/1:5:6:12

The complete list of modcall parameters

Usage:  modcall [OPTION] ... READSFILE
      --help                        display this help and exit.
require arguments:
      -b, --bam-file=NAME           modified sorted bam file.
      -r, --reference=NAME          reference fasta.
optional arguments:
      -o, --out-prefix=NAME         prefix of phasing result. default: modcall_result
      -t, --threads=Num             number of thread. default:1
phasing arguments:
      -m, --modThreshold=[0~1]      value extracted from MM tag and ML tag.
                                    above the threshold means modification occurred. default: 0.8
      -u, --unModThreshold=[0~1]    value extracted from MM tag and ML tag.
                                    above the threshold means no modification occurred. default: 0.2
      -e, --heterRatio=[0~1]        modified and unmodified scales.
                                    a higher threshold means that the two quantities need to be closer. default: 0.7
      -i, --noiseRatio=[0~1]        not being judged as modified and unmodified is noise.
                                    higher threshold means lower noise needs. default: 0.2
      -a, --connectAdjacent=Num     connect adjacent N METHs. default:6
      -c, --connectConfidence=[0~1] determine the confidence of phasing two ASMs.
                                    higher threshold requires greater consistency in the reads. default: 0.9

Input Preparation

Generate reference index

Index the reference genome with samtools.

samtools faidx reference.fasta

Generate alignment and index files

Produce read-to-reference alignment via minimap2 and sort/index the bam by samtools.

# generate alignment flie with minimap2 according to the sequencing platform e.g. map-pb/map-ont/map-hifi
# Note that the MD-tag is required by sniffles (–MD).
minimap2 --MD -ax map-ont -t 10 reference.fasta reads.fastq -o alignment.sam

# sort alignment file
samtools sort -@ 10 alignment.sam -o alignment.bam

# index alignment file
samtools index -@ 10 alignment.bam

Generate single nucleotide polymorphism (SNP) file

e.g. PEPPER-Margin-DeepVariant or Clair3 pipeline.

INPUT_DIR={input data path}
OUTPUT_DIR={output data path}
BAM=alignment.bam
REF=reference.fasta
THREADS=10

sudo docker run \
-v "${INPUT_DIR}":"${INPUT_DIR}" \
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
kishwars/pepper_deepvariant:r0.7 \
run_pepper_margin_deepvariant call_variant \
-b "${INPUT_DIR}/${BAM}" \
-f "${INPUT_DIR}/${REF}" \
-o "${OUTPUT_DIR}" \
-t "${THREADS}" \
--ont_r9_guppy5_sup

# --ont_r9_guppy5_sup is preset for ONT R9.4.1 Guppy 5 "Sup" basecaller
# for ONT R10.4 Q20 reads: --ont_r10_q20
# for PacBio-HiFi reads: --hifi

Generate Structural variation (SV) file

e.g. sniffles or CuteSV.

# In sniffles1 please specofic --num_reads_report -1. For sniffles2 please specify --output-rnames instead.
sniffles -t 10 --num_reads_report -1 -m alignment.bam -v SV.vcf # for sniffles1
sniffles --threads 10 --output-rnames --input alignment.bam --vcf SV.vcf # for sniffles2

# cuteSV command for PacBio CLR data:
cuteSV alignment.bam reference.fasta SV.vcf work_dir --report_readid --genotype

# additional platform-specific parameters suggested by cuteSV
# PacBio CLR data: 
--max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 200 --diff_ratio_merging_DEL 0.5
# PacBio CCS(HIFI) data: 
--max_cluster_bias_INS 1000 --diff_ratio_merging_INS 0.9 --max_cluster_bias_DEL 1000 --diff_ratio_merging_DEL 0.5
# ONT data: 
--max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3

Carry methylation tags to BAMs

The -T parameter in samtools fastq extracts tags from the BAM file and stores them in the header of the FASTQ file. Please ensure that the BAM file includes both MM and ML tags and carried on in the following way.

samtools fastq -T '*' methylcall.raw.bam > methylcall.raw.fastq

Then, specify the -y option in minimap2 which appends tags stored in the FASTQ header into the BAM file.

minimap2 -ax map-ont -y reference.fasta methylcall.raw.fastq 

Comparison with other SNP-phasing programs

LongPhase is >30x faster than WhatsHap and Margin and produces much larger blocks when tested on HG002, HG003,and HG004. btac058f3

Speed

LongPhase can phase a human genome within 1-2 minutes. phase (-t 24) v1.6 (Time) v1.6 (Memory)
HG002 ONT R10.4.1 10x 39s 15.1G
HG002 ONT R10.4.1 20x 53s 15.6G
HG002 ONT R10.4.1 30x 68s 24.4G
HG002 ONT R10.4.1 40x 217s 26.6G
HG002 ONT R10.4.1 50x 262s 22.2G
HG002 ONT R10.4.1 60x 113s 33.4G

Citation

Jyun-Hong Lin, Liang-Chi Chen, Shu-Qi Yu and Yao-Ting Huang, LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants, Bioinformatics, 2022.


Contact

Yao-Ting Huang, ythuang at cs.ccu.edu.tw