ohsu-comp-bio / vrs_anvil_toolkit

Extract clinical variant interpretations from VCF using GA4GH VRS IDs
MIT License
2 stars 0 forks source link

Profile VRS IDs #2

Closed bwalsh closed 7 months ago

bwalsh commented 9 months ago

install vcftools

Determine the best way to install vcftools The steps below compile it from source, is there a pre-built binary executable?

  139  git clone https://github.com/vcftools/vcftools.git
  140  cd vcftools/
  141  ./autogen.sh
  142  ./configure
  143  make
  144  make install
  145  pwd
  146  ls -l
  147  Makefile
  148  cat Makefile
  149  make install
  150  cd src/cpp
  151  ls -l
  152  /usr/bin/install -c vcftools ~
  153  vcftools
  154  ~/vcftools

find a public access, large VCF file

TODO - check with team (jordan, nasim, briank) for an appropriate example

split the vcf by chromosomes

seq 1 22 | xargs -P 0 -I PATH ~/vcftools --recode --vcf MY-BIG-VCF-FILE.vcf --chr PATH --out chrPATH

normalize VRS

ls -1 *.recode.vcf |  xargs -P 0 -I PATH  python3 -m ga4gh.vrs.extras.vcf_annotation --vcf_in PATH --vcf_out PATH.annotated.vcf.gz --vrs_pickle_out PATH.vrs_objects.pkl --assembly GRCh37  --seqrepo_root_dir ~/seqrepo/2021-01-29

benchmark overall throughput, log memory utilization and cpu

See htop

quinnwai commented 9 months ago

First-Pass Design

  1. Add onto the setup.sh script to download and install vcftools from tar (prevents GitHub authorization needs). It can take the form of something like below...
cd ~
curl -LJO https://github.com/vcftools/vcftools/tarball/master

VCF_TOOLS_TAR=$(ls -1t vcftools*.tar.gz | head -n 1)
tar -xzvf $VCF_TOOLS_TAR
rm $VCF_TOOLS_TAR

VCF_TOOLS_DIR=$(ls -1td vcftools* | head -n 1)
cd $VCF_TOOLS_DIR

./autogen.sh
./configure
make
sudo make install
  1. Process the VCF by splitting it in chunks and then merging into an annotated VCF using above code. As seen below, maximize parallelism and make use of vcftools...
# code to load .env file here
seq 1 22 | xargs -P 0 -I PATH ~/vcftools --recode --vcf MY-BIG-VCF-FILE.vcf --chr PATH --out chrPATH
ls -1 *.recode.vcf |  xargs -P 0 -I PATH  python3 -m ga4gh.vrs.extras.vcf_annotation --vcf_in PATH --vcf_out PATH.annotated.vcf.gz --vrs_pickle_out PATH.vrs_objects.pkl --assembly GRCh37  --seqrepo_root_dir $SEQREPO_ROOT_DIR
ls -1 *.annotated.vcf.gz | xargs vcf-concat > merged_output.vcf
  1. For performance testing... a. Get htop and Activity Monitor started b. Run the above process X amount of times (1 to start, more when I get more confident) c. on Activity Monitor, ensure computer is maxed out between %CPU, %memory, shared memory, # threads. Might want to filter by a specific process ID if that's how it works. Play around. c. On htop, track some of the heavy CPU and memory processes looking at if CPU

Questions

@bwalsh I'll be working to implement this now and will readily switch tacts if and when you have feedback.