samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
642 stars 174 forks source link

String Encoding #18

Closed zorgiepoo closed 9 years ago

zorgiepoo commented 10 years ago

VCF should either have an option in the header to choose a specific string encoding (e.g: utf-8, latin-1, ascii) with a default option set, or it should be documented which encoding VCF should be in.

pd3 commented 10 years ago

Not so long ago there was a discussion about this on the vcftools-spec mailing list. This proposal by Eugene Clark is likely to appear in the VCF specification:

  1. In order to address the need to represent non-ASCII characters in INFO field values, VCF files are assumed to be encoded in UTF-8 unless a "##fileencoding=NNN" header is present. To support stream based processing of VCF files, this header must immediately follow the version header. Because US-ASCII is a subset of UTF-8, this should be fully backwards compatible.
  2. Characters reserved as structure delimiters must be encoded using %NN when appearing in content. This would apply to ALL content fields (INFO values, metadata header descriptions, variant Ids, etc). The reserved characters are therefore: newline (\n), carriage return (\r), tab (\t), hash (#), greater than (>), less than (<), equals (=), semicolon (;), comma (,), percent sign (%).
pd3 commented 9 years ago

Addressed by https://github.com/samtools/hts-specs/commit/a09e56feac09c831a490d97b423aac5a78960650