Open jbusecke opened 10 months ago
Here's a way to sanitize en-dashes:
"MPI‐M".encode("utf-8").replace(b"\xe2\x80\x90", b"-").decode()
The above did not work, it basically triggered the original error on a different line.
wrapper = lambda x: [fn(x)]
File "/home/jovyan/AREAS/CMIP6-PGF/reproducer.py", line 20, in _strip_attrs
new_value = att_value.encode("utf-8").replace(b"\xe2\x80\x90", b"-").decode()
UnicodeEncodeError: 'utf-8 [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|StripAttrs|DetermineSchema/StripAttrs/Strip Attrs']' codec can't encode characters in position 61-63: surrogates not allowed
Exception ignored in: <function File.close at 0x7f1dca683940>
Traceback (most recent call last):
File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/h5netcdf/core.py", line 1200, in close
File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/h5py/_hl/files.py", line 585, in close
TypeError: bad operand type for unary ~: 'NoneType'
What worked is either
.encode("utf-8", 'ignore').decode()
.encode("utf-8", 'replace').decode()
The latter fills ???
for offending characters. Aside from this triggering some serious cravings for my favorite childhood audiobooks I think in the example above that this sanitized version (with (.., 'replace')
):
Sanitized datasets attributes field references:
MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI‐M Earth System Model version 1.2 (MPI‐ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A high‐resolution version of the Max Planck Institute Earth System Model MPI‐ESM1.2‐HR. J. Adv. Model. EarthSyst.,10,1383–1413, doi:10.1029/2017MS001217
---->
MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI???M Earth System Model version 1.2 (MPI???ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A high???resolution version of the Max Planck Institute Earth System Model MPI???ESM1.2???HR. J. Adv. Model. EarthSyst.,10,1383???1413, doi:10.1029/2017MS001217
more clearly indicates that things were replaced than this (with(..., 'ignore')
:
Sanitized datasets attributes field references:
MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI‐M Earth System Model version 1.2 (MPI‐ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A high‐resolution version of the Max Planck Institute Earth System Model MPI‐ESM1.2‐HR. J. Adv. Model. EarthSyst.,10,1383–1413, doi:10.1029/2017MS001217
---->
MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPIM Earth System Model version 1.2 (MPIESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A highresolution version of the Max Planck Institute Earth System Model MPIESM1.2HR. J. Adv. Model. EarthSyst.,10,13831413, doi:10.1029/2017MS001217
you've got two different versions of dash that make use of surrogates: b"\xe2\x80\x90"
and b"\xe2\x80\x93"
. To replace both, you'd have to:
s = "..."
s.encode("utf-8").replace(b"\xe2\x80\x90", b"-").replace(b"\xe2\x80\x93", b"-").decode("utf-8")
This also works for me:
s.replace("‐", "-").replace("–", "-")
(use .encode("utf-8")
to verify that any multi-byte characters are gone)
Edit: here's some more details of what those three bytes mean, in case you're interested (i.e. the encoding rules for UTF-8). If we look at the binary representation of the first byte (\xe2 ), we get 11100010 , where the number of set bits at the beginning followed by a 0 tells us the number of total bytes (the only exception is 1 byte which starts with just 0 to be compatible with ascii – the 10 is the data byte prefix, see below). In other words: |
nbytes | prefix | data bits |
---|---|---|---|
1 | 0 | 7 | |
2 | 110 | 5 | |
3 | 1110 | 4 | |
4 | 11110 | 3 |
In this case the prefix is 1110
, so three bytes, leaving us with 0010
as data. The other bytes always start with 10
, so after removing that from \x80
(10000000
) we get 000000
, and from \x90
we get 010000
. That gives us 0010000000010000
, or U+2010
, which is the unicode code point for "hyphen". b"\xe2\x80\x93"
is then U+2013
, or "en dash".
(Deleted last comment because I realized you were saying that there are two forms of the dash that use surrogates!)
I think for now I am ok with replacing all weird characters with ???
(I hope I am understanding my solution up top correctly) in my use-case, but this is super helpful @keewis.
@cisaacstern I wonder if this is something for -recipes
to take care of properly? Maybe I am just too tired to think about string en/decoding rn and will pick that up again tomorrow, haha.
@cisaacstern and I just debugged a class of failed CMIP6 recipes and came up with a relatively easy reproducer:
gives:
We narrowed this down to some issue with the dataset attributes, more specifically the
reference
attribute:There seems to be different types of dashes used here which trip up the encoding.
We concluded this needs a custom sanitizer stage in the CMIP6 recipe, but wanted to leave this issue up if other PGF users came across it.