samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

Conversion from standard VCF to Local-alleles VCF #2105

Open jkbonfield opened 7 months ago

jkbonfield commented 7 months ago

This is a request for improved documentation.

Doing a hunt through the documentation I find local alleles mentioned in one section only:

$ egrep -n -i "local.*allele" ../bcftools/doc/*.txt
2008:*-L, --local-alleles* 'INT'::
2012:    and alternate alleles. The *-L, --local-alleles* option allows replacement of such tags
2013:    with a localized tag (FORMAT/LPL) which only includes a subset of alternate alleles relevant

This is the -L option of bcftools merge. The usage for this claims:

    -L, --local-alleles INT           If more than INT alt alleles are encountered, drop FMT/PL and output LAA+LPL instead; 0=unlimited [0]

It's unclear how we can use this as a conversion tool for an existing file. I can see two possible routes.

  1. Take an existing many-sample VCF and split into potentially thousands of per-sample files, and then merge them back together again. This seems excessive and probably extremely time consuming. Plus unlike samtools, there's no bcftools split subcommand, so I assume this would be many passes through the data with bcftools view -S which feels like it'll end up be O(N^2) complexity in the number of samples. Or am I missing a split command under a different name? [Edit: yes - see below. I found a plugin later on]

  2. Merging the existing VCF with an empty VCF. I'm unsure of how to go about this.

I tried method 2 initially as it seemed the most viable, but immediately hit an error.

$ echo '##fileformat=VCFv4.2' > _
$ bcftools merge -L 0 /nfs/users/nfs_j/jkb/scratch/data/gnomad-g1k.10k.vcf _
Error: "--local-alleles 0" makes no sense, expected value bigger or equal than 1

Yet the usage statement claims 0 means unlimited. (And confusingly implies it's the default via [0], although it has no default as it's a required argument.)

Not deterred, I soldiered on and took it at its word and went with 1.

[Incidentally - a feature request. Please permit tools to work on uncompressed data when we're reading the entire file. It's tedious to have to compress and index everything when I know everything is in the same order already.]

I also hit problems with not being sure what a valid empty VCF file looks like. Is there a command to take the header from an existing VCF and strip off all the samples, so we get an empty one? I did it with egrep and echoes in the end.

After compressing and indexing that, my merge now produces LAA in the header, but as far as I can see it's totally absent in the output file.

So am I correct in assuming now that it's not possible by my guess number 2 above, and the only route to producing LAA is a full sample by sample split and then a remerge back together again?

The back story to all of this is I just want to be able to produce a baseline from an existing merged file that's not using LAA to see how much better it can get, before looking into other formats and other tools. Help would be appreciated, and I'm sure it's not just beneficial to me (hence why putting the issue here rather than direct email).

jkbonfield commented 7 months ago

Ah sorry, I just found bcftools plugin split, so that answers the "how to split" question.

It's still not clear if this is the only way to convert from normal to local-allele though.

pd3 commented 7 months ago

It's unclear how we can use this as a conversion tool for an existing file. I can see two possible routes.

1. Take an existing many-sample VCF and split into potentially thousands of per-sample files, and then merge them back together again.  This seems excessive and probably extremely time consuming.  Plus unlike samtools, there's no `bcftools split` subcommand, so I assume this would be many passes through the data with `bcftools view -S` which feels like it'll end up be O(N^2) complexity in the number of samples.  Or am I missing a split command under a different name?

2. Merging the existing VCF with an empty VCF.  I'm unsure of how to go about this.

Currently we have no command to convert between PL and LPL. The intention was to provide the functionality where it is actually needed and avoid producing such monster sites, which happens at the merge step.

Is the purpose of your experimentation just testing or does it come from a real world scenario? In any case, a dedicated program, or maybe an option in view or convert would be the way to go about it. Splitting would be a very heavy solution, given this is usually an issue in files with tens of thousands samples.

jkbonfield commented 7 months ago

Well it's chicken and egg.

I don't know of any publically available many-sample data files using LPL. I do know of ones using normal PL.

I agree the answer is not to start from here, but here is where I'm at! If you know of public data with LPL then feel free to point me towards it. Thanks.

I'm testing the split method anyway as this is really just an experiment to see how this works rather than something for a pipeline. One feature request is +split needs a --threads option. It's dominated by bgzf compression time.