kuvshinovdr / srcstats

A small console program to gather and report C++ source files statistics.
MIT License
1 stars 1 forks source link

UTF #2

Open kuvshinovdr opened 3 months ago

kuvshinovdr commented 3 months ago

Check for UTF encodings and transcode them into UTF-32 to count at least codepoints (and then visible symbols) as line lengths instead of just bytes. The following algorithm is to be used:

  1. Check BOM, if has UTF-16LE/BE, UTF-8 BOM then try to transcode correspondingly.
  2. If there are interleaving NULs then try transcode as UTF-16LE/BE accordingly.
  3. If none of the above try to transcode as UTF-8.
  4. If transcoding failed then if it has interleaving NULs report this file as possibly binary one and do not add it to the statistics.
  5. If transcoding failed and there is no NULs interpret this file as native 1-byte ASCII-compatible encoding (e.g. Latin-1).

Problem: codecvt is deprecated and is to removed in C++26 and we have no standard facility to transcode UTF.