Open bambrozio opened 4 years ago
Hi, yeah that's not the best tool for this as there is a buffer of -n 10000 variants and each time a new variant is read, its is replaced randomly in the buffer. So if there is more than 10000 variants for chr22, then it will be filled with chr22.
I agree the current doc is not clear about it.
You could try to use: http://lindenb.github.io/jvarkit/VCFShuffle.html and then pipe it into downsamplevcf.
Sorry, could you please provide a command line example? Actually I wanted to downsample to 100k SNPs, do you think this is possible? I'm giving a try on each SNP, generating a new VCF file for each of them. After that, I can concat the results. Do you think this will work? It's in progress here...
with GNU tools:
gunzip -c input.vcf.gz | grep '^#' > out.vcf
gunzip -c input.vcf.gz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> out.vcf
I've given a try in a smaller VCF (405M), and I got two significant different outputs. Using the AWK, I got an output VCF of 9.7M, while using the Jar, I got an output of 97M... Any idea why of the difference?
Here's the commands I utilised:
Using the AWK:
$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | grep '^#' > chr10.vcf
$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.s
ubset.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t
$'\t' -k1,1 -k2,2n -k4,4 >> chr10.vcf
Using the JAR:
gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | java -jar downsamplevcf.jar -N 1 -n 10000 > chr10UsingJar.vcf
input:
$ ls -lah ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz
-rw-r--r-- 1 bambrozi staff 405M 18 Mar 22:47 /Users/bambrozi/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz
Outputs:
$ ls -lath chr10*
-rw-r--r-- 1 bambrozi staff 97M 19 Mar 19:26 chr10UsingJar.vcf
-rw-r--r-- 1 bambrozi staff 9.7M 19 Mar 19:15 chr10.vcf
Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output? If so, I can concat the output VCFs later (after the jar sampling)
here you're taking 1000 variants. n | head -n 1000 | cut -f 2- | sort -T . -t
and here 10000 : -jar downsamplevcf.jar -N 1 -n 10000
Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output?
no, again, you shoud use the awk script or my tool vcfshuffle + vcfhead. It could be something like.
gunzip -c in.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 10000 > out.vcf
Hi, thanks for the help! Still not working though:
With the AWK
, I got an error in the sort command sort: No such file or directory
:
$ echo START: `date` \
> && gunzip -c ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | grep '^#' > ALL.phase3.biallelic-only-10kSNP.vcf \
> && gunzip -c ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 10000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> ALL.phase3.biallelic-only-10kSNP.vcf \
> echo START: `date`
START: Fri 20 Mar 2020 09:21:09 GMT
sort: No such file or directory
With the vcfshuffler.jar
, I got:
$ gunzip -c ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 100000 > ALL.phase3.biallelic-only-100kSNP.vcf
Error: Unable to access jarfile vcfshuflle.jar
[SEVERE][VcfHead]Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input V
CF file
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
at htsjdk.variant.vcf.VCFIteratorBuilder$VCFReaderIterator.<init>(VCFIteratorBuilder.java:177)
at htsjdk.variant.vcf.VCFIteratorBuilder.open(VCFIteratorBuilder.java:97)
at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorFromInputStream(VCFUtils.java:288)
at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorStdin(VCFUtils.java:335)
at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIterator(VCFUtils.java:312)
at com.github.lindenb.jvarkit.util.jcommander.Launcher.openVCFIterator(Launcher.java:515)
at com.github.lindenb.jvarkit.tools.misc.VcfHead.doWork(VcfHead.java:110)
at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:777)
at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:940)
at com.github.lindenb.jvarkit.tools.misc.VcfHead.main(VcfHead.java:156)
[INFO][Launcher]vcfhead Exited with failure (-1)
I've tried with: bgz, gz and plain vcf.
I can read the VCF's normally using, for example, hail
, pyvcf
, plink
... Thus, I'm assuming the VCF is healthy.
I have the 1000 genome phase 3 VCF's concatenated in a single VCF. I can successfully read it with pyvcf, thus the file is consistent and healthy. When I try to downsample it by performing:
My result VCF contains only the chromosome 22.
I expected to get all the chromosomes, but with up to 10k SNPs in each of them. Am I using it wrongly, or is this a bug?