Version history tracking
conda install aquila_stlfr
(Please ensure channels are properly setup for bioconda before installing)
Aquila_stLFR_step1 --help
Aquila_stLFR_step2 --help
Aquila_stLFR_clean --help
Aquila_step1_hybrid --help
Aquila_stLFR_assembly_based_variants_call --help
Aquila_stLFR_phasing_all_variants --help
Aquila_step0_sortbam_hybrid --help
Aquila_stLFR_fastq_preprocess --help
# You can also check the below corresponding scripts for details
#Download the reference file (hg38)
wget https://zenodo.org/record/7689958/files/source.tar.gz
#Download hg38 "Uniqness_map"
wget https://zenodo.org/record/7689958/files/Uniqness_map_hg38.tar.gz
Aquila_stLFR utilizes Python3 (+ numpy, pysam, sortedcontainers, and scipy), SAMtools, and minimap2. To be able to execute the above programs by typing their name on the command line, the program executables must be in one of the directories listed in the PATH environment variable (".bashrc").
Or you could just run "./install.sh" to check their availability and install them if not, but make sure you have installed "python3", "conda" and "wget" first.
git clone https://github.com/maiziex/Aquila_stLFR.git
cd Aquila_stLFR
chmod +x install.sh
./install.sh
After running "./install.sh", a folder "source" would be download, it includes human GRCh38 reference fasta file, or you could also just download it by yourself from the corresponding official websites.
Put the "Aquila_stLFR/bin" in the ".bashrc" file, and source the ".bashrc" file
Or just use the fullpath of "Aquila_stLFR_step1.py" and "Aquila_stLFR_step2.py"
*We provide a small chromosome (chr21) example dataset to run the whole pipeline before you try it into the large dataset.
Aquila_stLFR/bin/Aquila_stLFR_step1.py --fastq_file S12878.fastq --bam_file S12878.bam --vcf_file S12878_freebayes.vcf --sample_name S12878 --out_dir Assembly_results_S12878 --uniq_map_dir Aquila_stLFR/Uniqness_map_hg38
--fastq_file: "S12878.fastq" is the stLFR fastq file (with BX:Z:barcode at the header, you can use Aquila_stLFR/bin/Aquila_stLFR_fastq_preprocess.py to generate the input fastq file, check here for the processing details)
--bam_file: "S12878.bam" is bam file generated from bwa-mem. How to get bam file, you can also check here.
--vcf_file: "S12878_freebayes.vcf" is VCF file generated from variant caller like "FreeBayes". How to get vcf file, you can also check here.
--sample_name: "S12878" are the sample name you can define.
--uniq_map_dir: "Aquila_stLFR/Uniqness_map_hg38" is the uniqness file you can download by "./install.sh".
--out_dir: default = ./Asssembly_results. You can define your own folder, for example "Assembly_results_S12878".
--block_threshold: default = 200000 (200kb)
--block_len_use: default = 100000 (100kb)
--num_threads: default = 8. It's recommended not to change this setting unless large memory node could be used (2*memory capacity(it suggests for assembly below)), then try to use "--num_threads 12".
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome. For example: use "--chr_start 1 --chr_end 5" will assemble chromsomes 1,2,3,4,5. Use "--chr_start 2 --chr_end 2" will only assembly chromosome 2. (*Notes: Use 23 for "chrX")
Coverage | Memory | Time for chr1 on a single node |
---|---|---|
90X | 350GB | 1-10:20:10 |
Coverage | Memory | Time for chr21 on a single node |
---|---|---|
90X | 100GB | 06:27:28 |
Aquila_stLFR/bin/Aquila_stLFR_step2.py --out_dir Assembly_results_S12878 --num_threads 30 --reference Aquila_stLFR/source/ref.fa
--reference: "Aquila_stLFR/source/ref.fa" is the reference fasta file you can download by "./install".
--out_dir: default = ./Asssembly_results, make sure it's the same as "--out_dir" from Step1 if you want to define your own output directory name.
--num_threads: default = 30, this determines the number of files assembled simultaneously by SPAdes.
--num_threads_spades: default = 5, this is the "-t" for SPAdes.
--block_len_use: default = 100000 (100kb)
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome. For example: use "--chr_start 1 --chr_end 2"
Coverage | Memory | Time for chr1 on a single node | --num_threads | --num_threads_spades |
---|---|---|---|---|
90X | 100GB | 11:13:19 | 30 | 20 |
Coverage | Memory | Time for chr21 on a single node | --num_threads | --num_threads_spades |
---|---|---|---|---|
90X | 100GB | 01:38:09 | 30 | 20 |
Assembly_Results_S12878/Assembly_Contigs_files: Aquila_contig.fasta and Aquila_Contig_chr*.fasta
Assembly_results_S12878
|
|-H5_for_molecules
| └-S12878_chr*_sorted.h5 --> (Fragment files for each chromosome including barcode, variants annotation (0: ref allele; 1: alt allele), coordinates for each fragment)
|
|-HighConf_file
| └-chr*_global_track.p --> (Pickle file for saving coordinates of high-confidence boundary points)
|
|-results_phased_probmodel
| └-chr*.phased_final --> (Phased fragment files)
|
|-phase_blocks_cut_highconf
|
|-Raw_fastqs
| └-fastq_by_Chr_* --> (fastq file for each chromosome)
|
|-ref_dir
|
|-Local_Assembly_by_chunks
| └-chr*_files_cutPBHC
| |-fastq_by_*_*_hp1.fastq --> (fastq file for a small phased chunk of haplotype 1)
| |-fastq_by_*_*_hp2.fastq --> (fastq file for a small phased chunk of haplotype 2)
| |-fastq_by_*_*_hp1_spades_assembly --> (minicontigs: assembly results for the small chunk of haplotype 1)
| └-fastq_by_*_*_hp2_spades_assembly --> (minicontigs: assembly results for the small chunk of haplotype 2)
|
└-Assembly_Contigs_files
|-Aquila_cutPBHC_minicontig_chr*.fasta --> (final minicontigs for each chromosome)
|-Aquila_Contig_chr*.fasta --> (final contigs for each chromosome)
└-Aquila_contig.fasta --> (final contigs for WGS)
Aquila_stLFR outputs an overall contig file Aquila_Contig_chr*.fasta
for each chromosome, and one contig file for each haplotype: Aquila_Contig_chr*_hp1.fasta
and Aquila_Contig_chr*_hp2.fasta
. For each contig, the header, for an instance, “>36_PS39049620:39149620_hp1” includes contig number “36”, phase block start coordinate “39049620”, phase block end coordinate “39149620”, and haplotype number “1”. Within the same phase block, the haplotype number “hp1” and “hp2” are arbitrary for maternal and paternal haplotypes. For some contigs from large phase blocks, the headers are much longer and complex, for an instance, “>56432_PS176969599:181582362hp1 merge177969599:178064599_hp1-177869599:177969599_hp1”. “56” denotes contig number, “176969599” denotes the start coordinate of the final big phase block, “181582362” denotes the end coordinate of the final big phase block, and “hp1” denotes the haplotype “1”. “177969599:178064599_hp1” and “177869599:177969599_hp1” mean that this contig is concatenated from minicontigs in small chunk (start coordinate: 177969599, end coordinate: 178064599, and haplotype: 1) and small chunk (start coordinate: 177869599, end coordinate: 177969599, and haplotype: 1).
Aquila_stLFR_clean.py
. Or you can keep them for some analysis (check the above output directory tree for details).Aquila_stLFR/bin/Aquila_stLFR_clean.py --assembly_dir Assembly_results_S12878
Assemlby_results_S12878
as input directory to generate a VCF file which includes SNPs, small Indels and SVs.wget https://zenodo.org/record/7689958/files/Uniqness_map_hg19.tar.gz
Uniqness_map
for Aquila, check the details of hoffmanMappability to get the corresponding "k100.umap.bed.gz", then run Aquila/bin/Get_uniqnessmap_for_Aquila.py
to get the final "Uniqness_map" folder to run Aquila.Uniqness_map
folder to run Aquila, check How_to_get_Umap for details.Aquila_stLFR/bin/Aquila_step1_hybrid.py --bam_file_list 10x.bam,stLFR.bam --vcf_file_list S24385_10x_freebayes.vcf,S24385_stLFR_freebayes.vcf --sample_name_list S24385_10x,S24385_stLFR --out_dir Assembly_results_hybrid --uniq_map_dir Aquila_stLFR/Uniqness_map_hg38
--bam_file: "10x.bam" is bam file generated from barcode-awere aligner like "Lonranger align". "stLFR.bam" is bam file generated from "bwa-mem". Each bam file is seperately by comma (",").
--vcf_file: "S24385_10x_freebayes.vcf" and "S24385_stLFR_freebayes.vcf" are VCF files generated from variant caller like "FreeBayes". Each VCF file is seperately by comma (",").
--sample_name: S24385_10x,S24385_stLFR are the sample names you can define. Each sample name is seperately by comma (",").
--uniq_map_dir: "Aquila_stLFR/Uniqness_map_hg38" is the uniqness file you can download by "./install.sh".
--out_dir: default = ./Asssembly_results
--block_threshold: default = 200000 (200kb)
--block_len_use: default = 100000 (100kb)
--num_threads: default = 8. It's recommended not to change this setting unless large memory node could be used (2*memory capacity(it suggests for assembly below)), then try to use "--num_threads 12".
--num_threads_for_samtools_sort: default = 20. This setting is evoked for "samtools sort".
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome. For example: use "--chr_start 1 --chr_end 5" will assemble chromsomes 1,2,3,4,5. Use "--chr_start 2 --chr_end 2" will only assemlby chromosome 2. (*Notes: Use 23 for "chrX") To use the above option "--chr_start, --chr_end", it is recommended (not required) to run the below command first to save more time for step1.
python Aquila_stLFR/bin/Aquila_step0_sortbam_hybrid.py --bam_file_list ./S24385_Lysis_2/Longranger_align_bam/S24385_lysis_2/outs/possorted_bam.bam,./S24385_Lysis_2H/Longranger_align_bam/S24385_lysis_2H/outs/possorted_bam.bam --out_dir Assembly_results_merged --num_threads_for_samtools_sort 10 --sample_name_list S24385_lysis_2,S24385_lysis_2H
Coverage | Memory | Time for chr22 on a single node |
---|---|---|
90X(stLFR) + 90X (10x linked-reads) | 200GB | 11:39:50 |
Aquila_stLFR/bin/Aquila_stLFR_step2.py --out_dir Assembly_results_hybrid --num_threads 30 --reference Aquila_stLFR/source/ref.fa
--reference: "Aquila_stLFR/source/ref.fa" is the reference fasta file you can download by "./install".
--out_dir: default = ./Asssembly_results, make sure it's the same as "--out_dir" from step1 if you want to define your own output directory name.
--num_threads: default = 20
--block_len_use: default = 100000 (100kb)
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome.