Support for multiple charsets

KashingLiu commented 3 years ago

Feature request description

In our repositories, there are some files encoded by 'utf-8', but there are also some files encoded by 'gb2312'. So when I open a file in Sourcegraph, it may have some mistaken code. Will you support different charsets?

/cc @sourcegraph/search-platform

camdencheek commented 3 years ago

Thanks for the request! There really isn't a good way to know the encoding of a file without it being explicitly set, but I wonder if we could respect any encodings set in the .gitattributes file.

This repo contains a good sampling of files with different encodings.

cc @sourcegraph/search-core

keegancsmith commented 3 years ago

This requires a bit of thought, but my initial reaction is it is quite difficult for us to support just in zoekt, let alone the rest of our stack which works with text. The main issue is around conversion between "byte offsets" and "rune offsets".

To think a bit further about just zoekt, the way it seperates runes vs bytes should translate across encodings (assuming it understands the encoding both at the indexing time and at the document match tree time).

If someone wanted to explore implementing this, getting it to work in Zoekt is likely the best place to do a proof that this is viable.

sourcegraph / sourcegraph-public-snapshot

Support for multiple charsets #24136

Feature request description