🚧 _needLR3.2 is currently a beta version and actively under construction 🚧
Please contact jgust1@uw.edu with issues or suggestions.
needLR is a command line tool that uses Jasmine merging to compare a query structural variant (SV) vcf to our collection of 1000 Genomes Project (1KGP) samples sequenced by Oxford Nanopore Technologies long-read sequencing (ONT LRS). The output is a .txt file with detailed annotations about the genomic context, OMIM phenotype association, and ancestry-specific allele frequencies of each of the SVs in the query vcf.
There are 3 key concepts that drive this project:
Thus, we need a comprehensive catalog of long-read sequencing (LRS)-based SV calls from healthy individuals in order to filter for rare potentially disease-causing SVs in clinical cases.
The Miller Lab is actively using Oxford Nanopore Technologies (ONT) LRS to sequence samples from the 1000 Genomes Project (1KGP). Our recent preprint describes the first 100 samples in the cohort. This version of needLR incorporates SV calls made by Sniffles2 v2.2 for 150 1KGP samples.
Gustafson JA, Gibson SB, Damaraju N, et al. Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation. Preprint. medRxiv. 2024;2024.03.05.24303792. Published 2024 Mar 7. doi:10.1101/2024.03.05.24303792
jasmine v1.1.5 1
bedtools v2.31.1 2
bcftools v1.19 3
1 Kirsche M, Prabhu G, Sherman R, et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat Methods. 2023;20(3):408-417. doi:10.1038/s41592-022-01753-3
2 Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-842. doi:10.1093/bioinformatics/btq033
3 Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. doi:10.1093/gigascience/giab008
wget https://s3.amazonaws.com/1000g-ont/needLR/needLR_local.tar.gz
[!NOTE] The zip file is ~3.2G, the unzipped directory is ~9.5G
tar -xvzf needLR_local.tar.gz
This will create a working directory called needLR_local. Everything needed to run needLR is inside.
cd needLR_local
This is what you should see:
backend_files/
needLR_3.2.sh
needLR_output/
EXAMPLE/
Flag | Description |
---|---|
-f | A .txt file that lists the full file path(s) to the query vcf(s) (The vcfs must be gzipped (*.vcf.gz) and have an index in the same directory as the vcf) |
-g | A fasta file for a reference genome (we use the hg38 reference recommended here) - This fasta is included in the backend_files download |
-t | The number of threads to use in the Jasmine merging step |
needLR_3.2.sh -f /file/path/to/list.txt -g /file/path/to/reference/genome.fa -t 20
As needLR is running, you will see the Jasmine output building - it will look like this:
Then it will look like this:
And eventually it will look like this:
(Sometimes this step can take quite a while)
Each query vcf will spawn an output subdirectory within the /needLR_local/needLR_output
directory.
The output files will be:
Output | Description |
---|---|
{SAMPLE_ID}_RESULTS.txt | Annotated query SVs (the main output) |
preprocessed_{SAMPLE_ID}.vcf | Preprocessed query vcf that is used in the Jasmine merge |
{SAMPLE_ID}_jasmine.vcf | Raw Jasmine output from the query sample + 1KGP sample merge |
[!NOTE] needLR generates many temporary files when running, this can add up to ~2G
[!NOTE] Jasmine will automatically make a seperate
/needLR_local/output
directory (this can be ignored)
The {SAMPLE_ID}_RESULTS.txt
file can easily be opened in Excel.
Be sure that Excel is set up to delimit columns by tab (and only tab)
These are the columns which can be sorted at will:
Column Name | Column Description |
---|---|
Chr | Query SV chromosome |
Start_Pos | Query SV start coordinate (hg38) |
End_Pos | Query SV end coordinate (hg38) |
SV_Length | Query SV length |
SV_Type | Query SV type |
1KGP_support | Samples in the 1000 Genomes Project samples that share the SV with the query sample (n=150) |
Genotype | Query SV genotype |
Alt_reads | Number of reads in the query sample supporting the SV |
Ref_reads | Number of reference reads in the query sample at the SV locus |
Total_reads | Total number of reads at the SV locus (in the query sample) |
Allele_Freq_ALL | Allele frequency of SV in 1KGP samples (n=150) |
Genes | Genes that the SV intersects with (canonical hg38 coordinates, per gencode) |
OMIM | OMIM phenotypes associated with any gene the SV intersects (OMIM 8/2023) |
Exonic | If the SV intersects with a canonical exon (hg38 coordinates, per gencode) |
Centromeric | If the SV intersects with a centromere (UCSC hg38) |
Pericentromeric | If the SV intersects with a pericentromeric region (+/-5Mb on either side of UCSC-defined centromere) |
Telomeric | If the SV intersects with a telomere (5Mb of either end of a chromosome) |
STR | If the SV intersects with a Short Tandem Repeat region (vamos original motifs, n=148) |
VNTR | If the SV intersects with a Variable Number Tandem Repeat region (vamos original motifs, individuals=148) |
Segdup | If the SV intersects with a segmental duplication (Genome in a Bottle v3.3) |
Repeat | If the SV intersects with a repeat region (UCSC hg38 repeat masker) |
Gap | If the SV intersects with an hg38 gap region (UCSC hg38 mapping and sequencing: gap) |
HiConf | If the SV is fully contained within a high confidence region (Genome in a Bottle T2TQ100-V1.0_stvar) |
Pop_Count_AFR | How many 1KGP AFR ancestry samples have the SV (n=51) |
Pop_Freq_AFR | Frequency (%) of 1KGP AFR ancestry samples with SV (n=51) |
Pop_Count_AMR | How many 1KGP AMR ancestry samples have the SV (n=18) |
Pop_Freq_AMR | Frequency (%) of 1KGP AMR ancestry samples with SV (n=18) |
Pop_Count_EAS | How many 1KGP EAS ancestry samples have the SV (n=25) |
Pop_Freq_EAS | Frequency (%) of 1KGP EAS ancestry samples with SV (n=25) |
Pop_Count_EUR | How many 1KGP EUR ancestry samples have the SV (n=24) |
Pop_Freq_EUR | Frequency (%) of 1KGP EUR ancestry samples with SV (n=24) |
Pop_Count_SAS | How many 1KGP SAS ancestry samples have the SV (n=32) |
Pop_Freq_SAS | Frequency (%) of 1KGP SAS ancestry samples with SV (n=32) |
Pop_Count_ALL | How many 1KGP SAS ancestry samples have the SV (n=150) |
Pop_Freq_ALL | Frequency (%) of 1KGP SAS ancestry samples with SV (n=150) |
Allele_Count_AFR | How many 1KGP AFR ancestry alleles have the SV (n=102) |
Allele_Freq_AFR | Frequency (%) of 1KGP AFR ancestry alleles with SV (n=102) |
Allele_Count_AMR | How many 1KGP AMR ancestry alleles have the SV (n=36) |
Allele_Freq_AMR | Frequency (%) of 1KGP AMR ancestry alleles with SV (n=36) |
Allele_Count_EAS | How many 1KGP EAS ancestry alleles have the SV (n=50) |
Allele_Freq_EAS | Frequency (%) of 1KGP EAS ancestry alleles with SV (n=50) |
Allele_Count_EUR | How many 1KGP EUR ancestry alleles have the SV (n=48) |
Allele_Freq_EUR | Frequency (%) of 1KGP EUR ancestry alleles with SV (n=48) |
Allele_Count_SAS | How many 1KGP SAS ancestry alleles have the SV (n=64) |
Allele_Freq_SAS | Frequency (%) of 1KGP SAS ancestry alleles with SV (n=64) |
Allele_Count_ALL | How many 1KGP SAS ancestry alleles have the SV (n=150) |
Allele_Freq_ALL | Frequency (%) of 1KGP SAS ancestry alleles with SV (n=150) |
GT_homWT | Number of 1KGP samples that are homozygous for the reference allele at the locus (n=150) |
GT_het | Number of 1KGP samples that are heterozygous for the SV allele at the locus (n=150) |
GT_homVAR | Number of 1KGP samples that are homozygous for the SV allele at the locus (n=150) |
HWE-p | Reference allele frequency in 1KGP samples (n=150) |
HWE-q | SV allele frequency in 1KGP samples (n=150) |
HWE | Is the SV in Hardy-Weinberg equilibrium in the 1KGP samples (n=150) |
Column headers are also defined in /needLR_local/needLR_output/column_key.txt
In the downloaded zip file there is a directory EXAMPLE/
which contains an example vcf and an example input_list.txt
.
If eveything is sut up correctly, you should be able to run the following command from inside the /needLR_local
directory
./needLR_3.2.sh -f EXAMPLE/input_list.txt -g /backend_files/hg38_fasta/hg38.no_alt.fa -t 20
and find the output here: needLR_local/needLR_output/EXAMPLE_HG01511_sniffles2_needLR_3.2
The results should match the example output available here in Excel format.
Considerations/shortcomings:
All of the allele frequencies are currently based on the number of autosomes in the 1KGP sample set, rendering allele frequencies (other than 0) for SVs on chrX and chrY inaccurate.
Some SVs are called and pass all filters but have a genotype of “./.” This can lead to a SV with an allele frequency of 0, but a population frequency of >0.
We ran needLR on sniffles2 vcfs from 8 individuals with pathogenic SVs that were unsolved by short-read sequencing.
needLR reduced the number of SVs per individual from an average of 22,583 total SVs to an average of 565 unique SVs:
Of the unique SVs, an average of 144 intersect with hg38 annotated genes:
Of the unique, genic SVs, even fewer intersect with canonical exons and/or are in genes associated with OMIM phenotypes:
Here is a "quick reference" annotated needLR output for just the first 150 1KGP samples (no query sample) - This has allele frequencies and genomic context info for each of the SVs seen in our first 150 samples.
Our goal is to have 500 1KGP samples sequenced and analyzed by the begining of 2025
For annotating inheritance and potential de novo SVs when trio data is available