String Encoding - Githubissues

zorgiepoo commented 10 years ago

VCF should either have an option in the header to choose a specific string encoding (e.g: utf-8, latin-1, ascii) with a default option set, or it should be documented which encoding VCF should be in.

pd3 commented 10 years ago

Not so long ago there was a discussion about this on the vcftools-spec mailing list. This proposal by Eugene Clark is likely to appear in the VCF specification:

In order to address the need to represent non-ASCII characters in INFO field values, VCF files are assumed to be encoded in UTF-8 unless a "##fileencoding=NNN" header is present. To support stream based processing of VCF files, this header must immediately follow the version header. Because US-ASCII is a subset of UTF-8, this should be fully backwards compatible.
Characters reserved as structure delimiters must be encoded using %NN when appearing in content. This would apply to ALL content fields (INFO values, metadata header descriptions, variant Ids, etc). The reserved characters are therefore: newline (\n), carriage return (\r), tab (\t), hash (#), greater than (>), less than (<), equals (=), semicolon (;), comma (,), percent sign (%).

pd3 commented 9 years ago

Addressed by https://github.com/samtools/hts-specs/commit/a09e56feac09c831a490d97b423aac5a78960650

samtools / hts-specs

String Encoding #18