pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
368 stars 41 forks source link

force reference output in VCF #241

Closed colindaven closed 1 year ago

colindaven commented 1 year ago

Hi, I really, really like the VCF output here.

I guess the VCF comes from VCF decompose after looking through the tutorial, so should I rather post this question there?

Anyway, I have 3 Arabidopsis public genomes and want to find variant positions (ideally SNPs) where all 3 are different. eg one has A, one has C, one has T. This is relevant for triploid experiments I'll be starting soon.

The last "genotype" columns with 1 and 0 seem to be useful for this, yet the reference is more implied here. Is there a way to force the reference genotype as an output column as well ? ie so 3 columns result, such as 0 1 0 , not just 0 1 as at present.

I am used to multisample SNP calling where you get one genotype for each BAM file, all versus the reference (eg with Freebayes).

Thanks, Colin

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  AT_C24_lq-v1    AT_Landsberg_hq-v1
AT_Col-0_hq-v1.1.chr5   15      >292>295        T       G       60      .       AC=1;AF=0.5;AN=2;AT=>292>294>295,>292>293>295;NS=2;LV=0 GT      1       0
AT_Col-0_hq-v1.1.chr5   26      >295>298        A       T       60      .       AC=2;AF=1;AN=2;AT=>295>296>298,>295>297>298;NS=2;LV=0   GT      1       1
AT_Col-0_hq-v1.1.chr5   82      >298>301        T       G       60      .       AC=1;AF=0.5;AN=2;AT=>298>299>301,>298>300>301;NS=2;LV=0 GT      1       0
AT_Col-0_hq-v1.1.chr5   83      >301>303        G       GC      60      .       AC=1;AF=0.5;AN=2;AT=>301>303,>301>302>303;NS=2;LV=0     GT      1       0
AndreaGuarracino commented 1 year ago

Hi @colindaven, if I am getting your question, you are asking to put as a sample column the sample used in the VCF itself as the reference. In a VCF file format, variants are expressed with respect to a reference, so the reference genotypes of AT_Col-0_hq-v1.1.chr5 are always 0. Do you want a column with all zeros? Maybe I misunderstood your question.

colindaven commented 1 year ago

Dear @AndreaGuarracino , thanks for your input.

I have now found SNPs where the reference is different to the two samples, which also display variation, eg using

grep -P "2\t1" *.vcf | less -S

or the reverse "1\t2".

Thanks for your assistance here, I now understand the subtle difference between a PGGB VCF and a multisample + ref traditional VCF.