Not working with VCFs that have multiple chromosomes.

bambrozio commented 4 years ago

I have the 1000 genome phase 3 VCF's concatenated in a single VCF. I can successfully read it with pyvcf, thus the file is consistent and healthy. When I try to downsample it by performing:

$ gunzip -c ../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | java -jar downsamplevcf.jar -N 1 -n 10000 > All.Phase3.10kSNPs.vcf
 ...
[INFO][DownSampleVcf]Count: 81,268,620 Elapsed: 57 minutes(90.29%) Remains: 6 minutes(9.71%) Last: 22:51,138,531
[INFO][DownSampleVcf]. Completed. N=81,271,745. That took:57 minutes

My result VCF contains only the chromosome 22.

I expected to get all the chromosomes, but with up to 10k SNPs in each of them. Am I using it wrongly, or is this a bug?

lindenb commented 4 years ago

Hi, yeah that's not the best tool for this as there is a buffer of -n 10000 variants and each time a new variant is read, its is replaced randomly in the buffer. So if there is more than 10000 variants for chr22, then it will be filled with chr22.

I agree the current doc is not clear about it.

You could try to use: http://lindenb.github.io/jvarkit/VCFShuffle.html and then pipe it into downsamplevcf.

bambrozio commented 4 years ago

Sorry, could you please provide a command line example? Actually I wanted to downsample to 100k SNPs, do you think this is possible? I'm giving a try on each SNP, generating a new VCF file for each of them. After that, I can concat the results. Do you think this will work? It's in progress here...

lindenb commented 4 years ago

with GNU tools:


gunzip -c  input.vcf.gz | grep '^#' > out.vcf

gunzip -c  input.vcf.gz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> out.vcf

bambrozio commented 4 years ago

I've given a try in a smaller VCF (405M), and I got two significant different outputs. Using the AWK, I got an output VCF of 9.7M, while using the Jar, I got an output of 97M... Any idea why of the difference?

Here's the commands I utilised:

Using the AWK:

$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | grep '^#' > chr10.vcf
$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.s
ubset.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t 
$'\t' -k1,1 -k2,2n -k4,4 >> chr10.vcf

Using the JAR:

gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | java -jar downsamplevcf.jar -N 1 -n 10000 > chr10UsingJar.vcf

input:

$ ls -lah ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz 
-rw-r--r--  1 bambrozi  staff   405M 18 Mar 22:47 /Users/bambrozi/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz

Outputs:

$ ls -lath chr10*
-rw-r--r--  1 bambrozi  staff    97M 19 Mar 19:26 chr10UsingJar.vcf
-rw-r--r--  1 bambrozi  staff   9.7M 19 Mar 19:15 chr10.vcf

bambrozio commented 4 years ago

Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output? If so, I can concat the output VCFs later (after the jar sampling)

lindenb commented 4 years ago

here you're taking 1000 variants. n | head -n 1000 | cut -f 2- | sort -T . -t

and here 10000 : -jar downsamplevcf.jar -N 1 -n 10000

Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output?

no, again, you shoud use the awk script or my tool vcfshuffle + vcfhead. It could be something like.

gunzip -c in.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 10000 > out.vcf

bambrozio commented 4 years ago

Hi, thanks for the help! Still not working though:

With the AWK, I got an error in the sort command sort: No such file or directory:

$ echo START: `date` \
> && gunzip -c  ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | grep '^#' > ALL.phase3.biallelic-only-10kSNP.vcf \
> && gunzip -c  ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 10000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> ALL.phase3.biallelic-only-10kSNP.vcf \
> echo START: `date`
START: Fri 20 Mar 2020 09:21:09 GMT
sort: No such file or directory

With the vcfshuffler.jar, I got:

$ gunzip -c ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 100000 > ALL.phase3.biallelic-only-100kSNP.vcf 
Error: Unable to access jarfile vcfshuflle.jar
[SEVERE][VcfHead]Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input V
CF file
        at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
        at htsjdk.variant.vcf.VCFIteratorBuilder$VCFReaderIterator.<init>(VCFIteratorBuilder.java:177)
        at htsjdk.variant.vcf.VCFIteratorBuilder.open(VCFIteratorBuilder.java:97)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorFromInputStream(VCFUtils.java:288)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorStdin(VCFUtils.java:335)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIterator(VCFUtils.java:312)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.openVCFIterator(Launcher.java:515)
        at com.github.lindenb.jvarkit.tools.misc.VcfHead.doWork(VcfHead.java:110)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:777)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:940)
        at com.github.lindenb.jvarkit.tools.misc.VcfHead.main(VcfHead.java:156)
[INFO][Launcher]vcfhead Exited with failure (-1)

I've tried with: bgz, gz and plain vcf.

I can read the VCF's normally using, for example, hail, pyvcf, plink... Thus, I'm assuming the VCF is healthy.

lindenb / jvarkit

Not working with VCFs that have multiple chromosomes. #155