tskit-dev / tskit

Population-scale genomics
MIT License
157 stars 73 forks source link

Chromosome data in VCF #2728

Open MleGhis opened 1 year ago

MleGhis commented 1 year ago

Hi, My SLiM simulations include multiple chromosomes (recombination rates are set at 0.5 between them, as described here ) and I'd like to create a VCF that reflects this structure. Sadly, the write_vcf method currently only accepts strings as input for the "contig_id" parameter, thus forcing every value in the CHROM column of the output VCF to be the same. I can easily create a list of the values I want in this column (since I know the end position of each chromosome), which makes it possible to parse the VCF myself, and I guess that creating a VCF for each chromosome and fusing them afterward would also be an option, but VCF files are rather heavy and many simulations are gonna run, meaning that I'd rather directly give write_vcf the informations needed to create the right file. Do you plan to give the possibility of creating VCF with multiple chromosomes in the future ? Thank you for reading, have a nice day.

benjeffery commented 1 year ago

Hi @MleGhis! Thanks for your suggestion, it is normal practice to have a single chromosome per tree sequence, so the API was initially designed to support this. But as the comment you linked suggests there may be reasons for doing this. It seems write_vcf could be adapted to take an array, but this doesn't seem ideal as the contig id should come from site metadata ideally. However, it might make sense as a solution for now as we don't have full multi-chrom support.

Would you like to prepare a PR of how this would work for the community to review? There is a development guide in the docs.

MleGhis commented 1 year ago

Thanks @benjeffery for your prompt reply. Unfortunately, as an intern, I might not have the time to do this right now, and I'll use another (unoptimised) solution for now. But I'll keep that in mind, and I might try later.

jeromekelleher commented 1 year ago

There's a few ways we could enable this I think:

  1. Add options to specify the coordinate ranges required for output in the VCF, and an option to suppress header generation. You could then write the different chunks to the same file, efficiently. I guess there might be some trouble with exact VCF conformance, if you want the contigs to be defined in the header, though.
  2. We add support for contig_id being some kind of mapping from interval -> string. I guess we should follow the conventions in RateMap to do this. Then we could put in the header in the correct form.