Open KashingLiu opened 3 years ago
Thanks for the request! There really isn't a good way to know the encoding of a file without it being explicitly set, but I wonder if we could respect any encodings set in the .gitattributes
file.
This repo contains a good sampling of files with different encodings.
cc @sourcegraph/search-core
This requires a bit of thought, but my initial reaction is it is quite difficult for us to support just in zoekt, let alone the rest of our stack which works with text. The main issue is around conversion between "byte offsets" and "rune offsets".
To think a bit further about just zoekt, the way it seperates runes vs bytes should translate across encodings (assuming it understands the encoding both at the indexing time and at the document match tree time).
If someone wanted to explore implementing this, getting it to work in Zoekt is likely the best place to do a proof that this is viable.
Feature request description
In our repositories, there are some files encoded by 'utf-8', but there are also some files encoded by 'gb2312'. So when I open a file in Sourcegraph, it may have some mistaken code. Will you support different charsets?
/cc @sourcegraph/search-platform