vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

Export genotype() and samples() in vcf output. #35

Closed BoPeng closed 7 years ago

BoPeng commented 7 years ago

I am storing a lot of variants from different samples within a vtools project. Now, I need to export these variants in vcf-format, but I also need to have a field with genotype information per sample within the info-field. With vtools output I can use these functions: genotype(,'missing=.') and samples() which gives me exactly what I want. But then, there will not be the vcf-specific variant format (insertions and deletions are represented with an additional reference base)

How can I use the named funtions with vtools export in order to get the vcf-format? Or is it possible to produce vcf-format with vtools output by any means?

BoPeng commented 7 years ago

With

vtools init test
vtools admin --load_snapshot vt_testData
vtools import CEU.vcf.gz --build hg18 --var_info DP
vtools output variant chr pos ref alt 'genotype()' 'samples()' -l5

you can do

$ vtools update variant --set 'sample_names=samples()'
$ vtools output variant chr pos ref alt sample_names -l5
1   533     G   C   NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   41342   T   A   NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   41791   G   A   NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   44449   T   C   NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   44539   C   T   NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874

Note that you cannot do samples=samples() because the name is reserved.

Now, you can export the field as usual

$ vtools export variant --format vcf --var_info sample_names | head -5
Writing:   0.0% [>                                                                                                                                     ]  in 00:00:001  533 .   G   C   .   PASS    NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   41342   .   T   A   .   PASS    NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   41791   .   G   A   .   PASS    NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   44449   .   T   C   .   PASS    NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874
1   44539   .   C   T   .   PASS    NA06985,NA06986,NA06994,NA07000,NA07037,NA07051,NA07346,NA07347,NA07357,NA10847,NA10851,NA11829,NA11830,NA11831,NA11832,NA11840,NA11881,NA11894,NA11918,NA11919,NA11920,NA11931,NA11992,NA11993,NA11994,NA11995,NA12003,NA12004,NA12005,NA12006,NA12043,NA12044,NA12045,NA12144,NA12154,NA12155,NA12156,NA12234,NA12249,NA12287,NA12414,NA12489,NA12716,NA12717,NA12749,NA12750,NA12751,NA12760,NA12761,NA12762,NA12763,NA12776,NA12812,NA12813,NA12814,NA12815,NA12828,NA12872,NA12873,NA12874

but the info field does not have the sample_names= header. To really export the field, you would have to define a customized vcf format by

  1. Copy ~/.varianttools/fmt/vcf.fmt to myvcf.fmt
  2. Edit myvcf.fmt and add the following section
[sample_names]
index=0
type=VARCHAR(255)
fmt=lambda x: x.replace(',', '|'), InfoFormatter('SampleNames')
  1. Then you could do
    $ vtools export variant --format myvcf --var_info sample_names | head -5
    Writing:   0.0% [>                                                                                                                                     ]  in 00:00:001  533 .   G   C   .   PASS    SampleNames=NA06985|NA06986|NA06994|NA07000|NA07037|NA07051|NA07346|NA07347|NA07357|NA10847|NA10851|NA11829|NA11830|NA11831|NA11832|NA11840|NA11881|NA11894|NA11918|NA11919|NA11920|NA11931|NA11992|NA11993|NA11994|NA11995|NA12003|NA12004|NA12005|NA12006|NA12043|NA12044|NA12045|NA12144|NA12154|NA12155|NA12156|NA12234|NA12249|NA12287|NA12414|NA12489|NA12716|NA12717|NA12749|NA12750|NA12751|NA12760|NA12761|NA12762|NA12763|NA12776|NA12812|NA12813|NA12814|NA12815|NA12828|NA12872|NA12873|NA12874
    1   41342   .   T   A   .   PASS    SampleNames=NA06985|NA06986|NA06994|NA07000|NA07037|NA07051|NA07346|NA07347|NA07357|NA10847|NA10851|NA11829|NA11830|NA11831|NA11832|NA11840|NA11881|NA11894|NA11918|NA11919|NA11920|NA11931|NA11992|NA11993|NA11994|NA11995|NA12003|NA12004|NA12005|NA12006|NA12043|NA12044|NA12045|NA12144|NA12154|NA12155|NA12156|NA12234|NA12249|NA12287|NA12414|NA12489|NA12716|NA12717|NA12749|NA12750|NA12751|NA12760|NA12761|NA12762|NA12763|NA12776|NA12812|NA12813|NA12814|NA12815|NA12828|NA12872|NA12873|NA12874
    1   41791   .   G   A   .   PASS    SampleNames=NA06985|NA06986|NA06994|NA07000|NA07037|NA07051|NA07346|NA07347|NA07357|NA10847|NA10851|NA11829|NA11830|NA11831|NA11832|NA11840|NA11881|NA11894|NA11918|NA11919|NA11920|NA11931|NA11992|NA11993|NA11994|NA11995|NA12003|NA12004|NA12005|NA12006|NA12043|NA12044|NA12045|NA12144|NA12154|NA12155|NA12156|NA12234|NA12249|NA12287|NA12414|NA12489|NA12716|NA12717|NA12749|NA12750|NA12751|NA12760|NA12761|NA12762|NA12763|NA12776|NA12812|NA12813|NA12814|NA12815|NA12828|NA12872|NA12873|NA12874
    1   44449   .   T   C   .   PASS    SampleNames=NA06985|NA06986|NA06994|NA07000|NA07037|NA07051|NA07346|NA07347|NA07357|NA10847|NA10851|NA11829|NA11830|NA11831|NA11832|NA11840|NA11881|NA11894|NA11918|NA11919|NA11920|NA11931|NA11992|NA11993|NA11994|NA11995|NA12003|NA12004|NA12005|NA12006|NA12043|NA12044|NA12045|NA12144|NA12154|NA12155|NA12156|NA12234|NA12249|NA12287|NA12414|NA12489|NA12716|NA12717|NA12749|NA12750|NA12751|NA12760|NA12761|NA12762|NA12763|NA12776|NA12812|NA12813|NA12814|NA12815|NA12828|NA12872|NA12873|NA12874
    1   44539   .   C   T   .   PASS    SampleNames=NA06985|NA06986|NA06994|NA07000|NA07037|NA07051|NA07346|NA07347|NA07357|NA10847|NA10851|NA11829|NA11830|NA11831|NA11832|NA11840|NA11881|NA11894|NA11918|NA11919|NA11920|NA11931|NA11992|NA11993|NA11994|NA11995|NA12003|NA12004|NA12005|NA12006|NA12043|NA12044|NA12045|NA12144|NA12154|NA12155|NA12156|NA12234|NA12249|NA12287|NA12414|NA12489|NA12716|NA12717|NA12749|NA12750|NA12751|NA12760|NA12761|NA12762|NA12763|NA12776|NA12812|NA12813|NA12814|NA12815|NA12828|NA12872|NA12873|NA12874

    Here I used a lambda function to replace , with |, but you can remove the lambda function to use , (which is allowed in variant info).