"Extracting region of interest from input file" causing a long delay

oyhel commented 5 years ago

I need to plot several regions using LocusZoom. For some regions plotting is very fast, only taking a few seconds. For others I have experienced waiting for 30 mins+ on "Extracting region of interest (chr2:25355105-25395105) from input file..". This region is just an example, this happens for several other regions as well. I have tried to figure out what it is doing during this long pause, but even adding DEBUG=True in the m2zfast.py does not provide any more information on what is going on here. Any help debugging this would be very helpful.

welchr commented 5 years ago

For regular tab-delimited input, most of the time is spent iterating over the file, and translating rsID to chr/pos. The SQLite lookups for each variant are expensive in very large files. We also can't assume the file is sorted, so every variant needs to be inspected. This of course wasn't a problem with GWAS sizes in 2009, but it's surely a problem today.

LocusZoom also supports the output format of EPACTS, which uses tabix-indexed files. We can take advantage of the tabix index to immediately subset the region needed for plotting, which reduces the work considerably.

There's nothing really special about the EPACTS format, and you could probably just tabix-index your existing file and tell LocusZoom which columns to look for in your file.

Relevant options:

  --epacts <string>
    EPACTS results file.

  --epacts-chr-col <string>
    Name of chrom column for EPACTS file. Defaults to #CHROM.

  --epacts-beg-col <string>
    Name of begin column for EPACTS file. Defaults to BEGIN.

  --epacts-end-col <string>
    Name of end column for EPACTS file. Defaults to END.

  --epacts-pval-col <string>
    Name of pvalue column for EPACTS file. Defaults to PVALUE.

So you could try:

# bgzip your file first
bgzip <your result file>

# Tabix index your result file
# -S 1 means skip first line in file (header)
tabix -s <column number of chromosome> -b <column number of start position> -e <column number of end position> -S 1 <your result file>.gz

# Run LocusZoom
# Note: <your result file>.tbi must exist from previous step
locuszoom --epacts <your result file>.gz --epacts-beg-col "CHROM" --epacts-beg-col "START" ...

Edit: updated to include bgzip.

oyhel commented 5 years ago

Thank you very much for the detailed reply!

I ended up converting my results to EPACTS format (I only needed columns: chrom, start, end, marker ID, and p-value), bgzipped and tabix indexed as you proposed. With this approach the extraction step finished almost instantly.

However, calculating LD using my own VCFs turned out to be the next bottleneck. This was partly because I only needed a subset of samples for calculating LD, but also because I had to plot the same region for several phenotypes and LocusZoom did not seem to cache the LD calculation when using my own VCF.

The solution I ended up with was to calculate LD using Plink upfront...

plink --bfile $bedstem --r2 inter-chr gz --ld-window-r2 0 --keep $samplelist --chr $CHR --from-bp $FROM --to-bp $TO --ld-snps $SNP --out outfile

...and then convert the output to user-supplied LD format... zcat outfile.ld.gz | tail -n+2 | awk 'BEGIN {print "snp1 snp2 dprime rsquare"} {print $6, $3, "0", $7}' > ldfile

...that I passed to Locuszoom using --ld. This feels a bit hacky, but it turned out to be quite memory efficient and easy to run in parallel which was a necessity for me.

statgen / locuszoom-standalone

"Extracting region of interest from input file" causing a long delay #8