jupyter / nbformat

Reference implementation of the Jupyter Notebook format
http://nbformat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
264 stars 152 forks source link

Clarify the the role of utf-8 #181

Open akhmerov opened 4 years ago

akhmerov commented 4 years ago

In several places nbformat seems to choose utf-8 as the default encoding (in particular when read or write get filenames as input).

Does that mean that notebook files must be utf-8 encoded? If yes, perhaps it would be worth to state it explicitly.

See https://github.com/jupyter/jupyter-sphinx/pull/125 for an example context where this matters.

MSeal commented 4 years ago

Yes most of the toolchains around ipynb files assume utf-8 encoding. That's a good point that it's not documented.

takluyver commented 4 years ago

The spec for JSON essentially says it's stored in UTF-8 unless a specific 'closed ecosystem' needs something different:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8

But it doesn't hurt to make it explicit.

mathograham commented 3 years ago

Hi, I'd like to try to work on stating explicitly that UTF-8 encoding must be used. Where should this information be stated, in the nbformat documentation? This would be my first attempt at a contribution to code, so please let me know if there is anything else I need to consider. Thanks!

akhmerov commented 3 years ago

Since it's a statement about the overall format, I'd say it belongs close to the statement that the file is JSON, so perhaps around here: https://nbformat.readthedocs.io/en/latest/format_description.html

That's my take, the maintainers might have a different opinion though.

westurner commented 3 years ago

To quote in full from the stdlib json module docs: https://docs.python.org/3/library/json.html#character-encodings :+1:

Standard Compliance and Interoperability
----------------------------------------

The JSON format is specified by :rfc:`7159` and by
`ECMA-404 <http://www.ecma-international.org/publications/standards/Ecma-404.htm>`_.
This section details this module's level of compliance with the RFC.
For simplicity, :class:`JSONEncoder` and :class:`JSONDecoder` subclasses, and
parameters other than those explicitly mentioned, are not considered.

This module does not comply with the RFC in a strict fashion, implementing some
extensions that are valid JavaScript but not valid JSON.  In particular:

- Infinite and NaN number values are accepted and output;
- Repeated names within an object are accepted, and only the value of the last
  name-value pair is used.

Since the RFC permits RFC-compliant parsers to accept input texts that are not
RFC-compliant, this module's deserializer is technically RFC-compliant under
default settings.

Character Encodings
^^^^^^^^^^^^^^^^^^^

The RFC requires that JSON be represented using either UTF-8, UTF-16, or
UTF-32, with UTF-8 being the recommended default for maximum interoperability.

As permitted, though not required, by the RFC, this module's serializer sets
*ensure_ascii=True* by default, thus escaping the output so that the resulting
strings only contain ASCII characters.

Other than the *ensure_ascii* parameter, this module is defined strictly in
terms of conversion between Python objects and
:class:`Unicode strings <str>`, and thus does not otherwise directly address
the issue of character encodings.

The RFC prohibits adding a byte order mark (BOM) to the start of a JSON text,
and this module's serializer does not add a BOM to its output.
The RFC permits, but does not require, JSON deserializers to ignore an initial
BOM in their input.  This module's deserializer raises a :exc:`ValueError`
when an initial BOM is present.

The RFC does not explicitly forbid JSON strings which contain byte sequences
that don't correspond to valid Unicode characters (e.g. unpaired UTF-16
surrogates), but it does note that they may cause interoperability problems.
By default, this module accepts and outputs (when present in the original
:class:`str`) code points for such sequences.

Are you suggesting that the nbformat spec needs to say:

What use cases would this unnecessarily impede?

Is this solving for an actual current problem?

sls1005 commented 10 months ago

I suggest stating that some particular encoding (like UTF-8) is preferred or recommended, or is the default.

westurner commented 10 months ago

JSON already specifies and Python implementations of JSON default to UTF-8.

Nbformat has not needed to specify UTF-8.

Why does nbformat need to choose the Unicode character representation, and how and which existing .ipynb notebooks files would be affected?

On Sun, Nov 12, 2023, 9:58 AM sls1005 @.***> wrote:

I suggest stating that some particular encoding (like UTF-8) is preferred or recommended, or is the default.

— Reply to this email directly, view it on GitHub https://github.com/jupyter/nbformat/issues/181#issuecomment-1807151482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMNSYMUM6R6ZEQ6ZTH33DYEDP2VAVCNFSM4NOF4AU2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBQG4YTKMJUHAZA . You are receiving this because you commented.Message ID: @.***>

sls1005 commented 7 months ago

To make people know better how to generate or decode this kind of files.

It will not affect the files but those who use or will use them.