mbloch / mapshaper

Tools for editing Shapefile, GeoJSON, TopoJSON and CSV files
http://mapshaper.org
Other
3.67k stars 529 forks source link

feat: encoding auto detect Chinese gb18030 #629

Closed zy6p closed 3 months ago

zy6p commented 3 months ago

I propose adding automatic GB18030 encoding detection to mapshaper, enhancing its handling of Chinese text in Shapefiles. GB18030, the most comprehensive Chinese character encoding, is essential for accurately processing modern Chinese datasets. This update aims to improve mapshaper's utility for users dealing with Chinese geographic data by ensuring compatibility with a wider range of datasets, including those adhering to this mandated standard. Integrating GB18030 support not only advances mapshaper's capabilities in managing international datasets but also makes it more inclusive and user-friendly for a global user base.

mbloch commented 3 months ago

Thanks very much for this contribution! I made a few changes before merging to master.

mbloch commented 3 months ago

I just published v0.6.75, which includes GB18030 detection. I noticed that the original function gave false positives on some other encodings, e.g. Windows 1256 (Arabic), so I added an additional requirement that a certain percentage of characters in the source data should be common Hanzi. If you find a GB18030 dataset that mapshaper fails to detect, please file a bug report!