Closed bwalsh closed 7 months ago
First-Pass Design
setup.sh
script to download and install vcftools
from tar (prevents GitHub authorization needs). It can take the form of something like below...cd ~
curl -LJO https://github.com/vcftools/vcftools/tarball/master
VCF_TOOLS_TAR=$(ls -1t vcftools*.tar.gz | head -n 1)
tar -xzvf $VCF_TOOLS_TAR
rm $VCF_TOOLS_TAR
VCF_TOOLS_DIR=$(ls -1td vcftools* | head -n 1)
cd $VCF_TOOLS_DIR
./autogen.sh
./configure
make
sudo make install
# code to load .env file here
seq 1 22 | xargs -P 0 -I PATH ~/vcftools --recode --vcf MY-BIG-VCF-FILE.vcf --chr PATH --out chrPATH
ls -1 *.recode.vcf | xargs -P 0 -I PATH python3 -m ga4gh.vrs.extras.vcf_annotation --vcf_in PATH --vcf_out PATH.annotated.vcf.gz --vrs_pickle_out PATH.vrs_objects.pkl --assembly GRCh37 --seqrepo_root_dir $SEQREPO_ROOT_DIR
ls -1 *.annotated.vcf.gz | xargs vcf-concat > merged_output.vcf
Questions
home/groups/EllrottLab/rc7-Zach/bmeg-etl/output-normalize/allele.annotated.vcf
)@bwalsh I'll be working to implement this now and will readily switch tacts if and when you have feedback.
install vcftools
Determine the best way to install vcftools The steps below compile it from source, is there a pre-built binary executable?
find a public access, large VCF file
TODO - check with team (jordan, nasim, briank) for an appropriate example
split the vcf by chromosomes
normalize VRS
benchmark overall throughput, log memory utilization and cpu
See htop