mzelst / covid-19

https://doi.org/10.5281/zenodo.5163263
Creative Commons Zero v1.0 Universal
62 stars 23 forks source link

Fryslân wrong as Fryslân in PDF and TEX: double unicode #39

Open sanderjo opened 3 years ago

sanderjo commented 3 years ago

(mag ook in nederlands, hoor)

Fryslân wrong in PDF and TEX: probably because double unicoded

image

Cause: https://github.com/mzelst/covid-19/blob/master/reports/daily_report.tex#L225

So that says "Fryslân". Inspection learns that "â" are total 4 bytes:

/git/covid-19/reports$ cat daily_report.tex | grep Frysl | head -1  | hd
00000000  46 72 79 73 6c c3 83 c2  a2 6e 20 26 20 31 37 30  |Frysl....n & 170|
00000010  32 20 26 20 32 36 31 2e  31 20 26 20 31 20 26 20  |2 & 261.1 & 1 & |
00000020  30 2e 32 20 26 20 31 20  26 20 30 2e 32 5c 5c 0a  |0.2 & 1 & 0.2\\.|
00000030

So: c3 83 c2 a2 ... that's a bit too much.

In other locations, Fryslân is spelled correctly:

~/git/covid-19$ cat data-rivm/disabled-people-per-day/rivm_daily_2020-12-23.csv | grep Frysl | head -1 "2020-12-23 10:00:00","2020-07-01","VR02","Fryslân",0,0,0,0

... and encoded correctly: just one Unicode UTF8 2-byte c3 a2

~/git/covid-19$ cat data-rivm/disabled-people-per-day/rivm_daily_2020-12-23.csv | grep Frysl | head -1  | hd
00000000  22 32 30 32 30 2d 31 32  2d 32 33 20 31 30 3a 30  |"2020-12-23 10:0|
00000010  30 3a 30 30 22 2c 22 32  30 32 30 2d 30 37 2d 30  |0:00","2020-07-0|
00000020  31 22 2c 22 56 52 30 32  22 2c 22 46 72 79 73 6c  |1","VR02","Frysl|
00000030  c3 a2 6e 22 2c 30 2c 30  2c 30 2c 30 0a           |..n",0,0,0,0.|
0000003d
sanderjo commented 3 years ago

I don't know what the flow is that leads to the file in reports.

So I used this hack to correct to "Fryslân"

sed -i 's/Fryslân/Fryslân/g' daily_report.tex
sed -i 's/Fryslân/Fryslân/g' daily_report.ttt

After pdflatex daily_report, I had a corrected PDF:

image

sanderjo commented 3 years ago

FWIW: files with wrong wording:

~/git/covid-19$ grep -irn Fryslân *
misc/maps/Gemeentegrenzen2021_simpelRD.geojson:331:{ "type": "Feature", "properties": { "id": 326, "statcode": "GM1900", "jrstatcode": "2021GM1900", "statnaam": "Súdwest-Fryslân", "rubriek": "gemeente" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 160686.19700000063, 573323.60399999842 ], [ 161499.23200000077, 572253.34699999914 ], [ 164397.41000000015, 573255.42000000179 ], [ 165952.11600000039, 571938.61899999902 ], [ 167589.32499999925, 571965.32699999958 ], [ 168998.86100000143, 571042.4 ], [ 170504.08500000089, 572177.06500000134 ], [ 171745.21299999952, 570849.22700000182 ], [ 173032.8898999989, 569542.50239999965 ], [ 174600.94610000029, 571352.58810000122 ], [ 175680.798599999, 570396.28940000013 ], [ 177986.9990000017, 570473.19099999964 ], [ 179386.0540000014, 571145.20899999887 ], [ 179791.65300000086, 569611.45300000161 ], [ 181862.69399999827, 567828.42000000179 ], [ 180738.7490000017, 566988.95499999821 ], [ 180998.07699999958, 564490.64400000125 ], [ 180399.875, 563902.5390000008 ], [ 180738.79, 562299.53700000048 ], [ 179694.06500000134, 562279.65509999916 ], [ 178447.1640000008, 560416.52299999818 ], [ 178613.0, 557195.0 ], [ 175164.57600000128, 555052.1400000006 ], [ 175232.74500000104, 549954.80299999937 ], [ 172161.17599999905, 549931.27100000158 ], [ 172000.0, 548406.39900000021 ], [ 168927.49980000034, 547611.19929999858 ], [ 166915.19949999824, 551161.69009999931 ], [ 163380.71770000085, 548885.19660000131 ], [ 161049.86560000107, 546167.7533 ], [ 162715.70690000057, 543661.60619999841 ], [ 161052.69640000165, 542404.33080000058 ], [ 158051.69000000134, 543056.24700000137 ], [ 158624.11580000073, 540415.59380000085 ], [ 157622.00930000097, 540105.46950000152 ], [ 155732.47190000117, 541102.47080000117 ], [ 153599.34629999846, 542932.90080000088 ], [ 152761.19990000129, 544469.59939999878 ], [ 153934.55579999834, 545809.57710000128 ], [ 156317.47060000151, 547001.864 ], [ 155929.02230000123, 550236.49329999834 ], [ 156718.61389999837, 552945.68690000102 ], [ 155643.78330000117, 557318.09299999848 ], [ 156401.61320000142, 559263.21020000055 ], [ 154252.96339999884, 563097.04 ], [ 155363.50400000066, 564033.68800000101 ], [ 154155.35949999839, 567133.97410000116 ], [ 152030.39829999954, 565211.69139999896 ], [ 148961.70699999854, 564491.02100000158 ], [ 140200.27299999818, 556944.39779999852 ], [ 140141.44350000098, 557012.25039999932 ], [ 148700.90300000086, 564414.54800000042 ], [ 152408.52199999988, 565652.24100000039 ], [ 154374.45100000128, 567569.36499999836 ], [ 154451.6790000014, 569029.82400000095 ], [ 156150.985, 570528.35099999979 ], [ 156631.77030000091, 573913.07030000165 ], [ 160686.19700000063, 573323.60399999842 ] ] ] ] } },
misc/maps/Gemeentegrenzen2021_simpelRD.geojson:356:{ "type": "Feature", "properties": { "id": 351, "statcode": "GM1970", "jrstatcode": "2021GM1970", "statnaam": "Noardeast-Fryslân", "rubriek": "gemeente" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 208482.5165, 603053.10649999976 ], [ 208452.5724, 603008.77490000054 ], [ 206539.75800000131, 602724.07600000128 ], [ 206075.7, 600937.04899999872 ], [ 207838.34699999914, 597058.56509999931 ], [ 210292.92599999905, 596688.31500000134 ], [ 210562.8740000017, 595414.42700000107 ], [ 212700.82200000063, 596077.67300000042 ], [ 214925.79659999907, 595331.182 ], [ 214483.45010000095, 591129.26720000058 ], [ 212822.38720000163, 589474.85940000042 ], [ 212355.76000000164, 587204.59 ], [ 211390.75800000131, 585907.80799999833 ], [ 208904.079, 587310.693 ], [ 207762.37000000104, 586236.92000000179 ], [ 205301.27299999818, 586690.03000000119 ], [ 198844.02, 584302.81899999827 ], [ 199489.83229999989, 589568.37099999934 ], [ 200644.25200000033, 590722.14799999818 ], [ 200282.46200000122, 593234.83300000057 ], [ 197327.80200000107, 593251.88500000164 ], [ 197255.01139999926, 591578.41220000014 ], [ 194239.27899999917, 591322.00800000131 ], [ 193425.546, 593087.92599999905 ], [ 190661.68, 591926.21299999952 ], [ 188980.85700000077, 590680.6950000003 ], [ 189641.50299999863, 589162.1790000014 ], [ 188022.47700000182, 587920.18800000101 ], [ 186581.31679999828, 589213.90929999948 ], [ 185115.66000000015, 587697.05000000075 ], [ 183664.3900000006, 587848.41000000015 ], [ 180076.71000000089, 589842.98000000045 ], [ 178216.21499999985, 589859.20899999887 ], [ 179133.67199999839, 591959.89600000158 ], [ 177335.58700000122, 592925.961 ], [ 176992.43910000101, 594772.00600000098 ], [ 181927.14499999955, 597412.30799999833 ], [ 184527.29050000012, 598376.6838000007 ], [ 185640.78400000185, 599377.86199999973 ], [ 188309.60229999945, 600619.4925 ], [ 191322.7434, 600356.64380000159 ], [ 196798.75600000098, 602028.41699999943 ], [ 202059.06399999931, 602649.735 ], [ 205006.98699999973, 602173.66099999845 ], [ 207507.305, 603439.364 ], [ 208482.5165, 603053.10649999976 ] ] ] ] } },
misc/scripts/coronaboetes.R:7:                                "Noardeast-Fryslân" = "Noardeast-Fryslân",
misc/scripts/coronaboetes.R:8:                                "Súdwest-Fryslân" = "Súdwest-Fryslân")
plot_scripts/daily_maps_plots.R:72:ggd.data$statnaam <- recode(ggd.data$statnaam, "GGD Fryslân" = "GGD Fryslân",
reports/daily_report.Rmd:200:dat.weekago$Province <- recode(dat.weekago$Province,"Friesland" = "Fryslân")
workflow/dashboards/cases_ggd_agegroups.R:5:dat$Municipal_health_service <- recode(dat$Municipal_health_service, "GGD Fryslân" = "GGD Fryslân",

So ... first step might be: brute-forcing the above sed command on these files to see if the problem goes away on next run ... ?

sanderjo commented 3 years ago

Root cause:

â in utf8 is C3 A2

But if those two bytes are not interpreted as UTF8 but incorrectly as two seperate ASCII chars, then you get

>>> chr(0xC3)
'Ã'
>>> chr(0xA2)
'¢'

So: â

And then ... if you then encode them into UTF8 (because they are non-US-ASCII), you get ... 4 bytes:

>>> 'â'.encode()
b'\xc3\x83\xc2\xa2'

... which we saw earlier on: c3 83 c2 a2

So, somewhere, probably, a correct UTF8 file is incorrectly interpreted as ASCII, and then put into UTF8 again ...