tc39 / proposal-shadowrealm

ECMAScript Proposal, specs, and reference implementation for Realms
https://tc39.es/proposal-shadowrealm/
1.44k stars 67 forks source link

Look for names in WebIDL that are not valid JS identifiers #55

Closed mikesamuel closed 4 years ago

mikesamuel commented 7 years ago

Issue 54 would benefit from knowing if the default environment binds globally properties that are not identifiers.

Crawl WebIDL and compare against the ReservedWord production.

mikesamuel commented 7 years ago

You can assign me, but I don't have perms to assign myself since I was crawling WebIDL the other day.

ljharb commented 7 years ago

(Linking to #54)

mikesamuel commented 7 years ago

Results

NOT IDENTIFIERS

class in TestObject @ blink//Source/bindings/tests/idls/core/TestObject.idl:0
import in HTMLLinkElement @ blink//Source/core/html/HTMLLinkElement.idl:0
default in HTMLMenuItemElement @ blink//Source/core/html/HTMLMenuItemElement.idl:0
default in HTMLTrackElement @ blink//Source/core/html/HTMLTrackElement.idl:0
default in SpeechSynthesisVoice @ blink//Source/modules/speech/SpeechSynthesisVoice.idl:0

Methodology

$ mkdir blinky
$ cd blinky
$ sudo $(which port) install py27-lex
$ sudo $(which port) install py27-regex
$ git clone https://chromium.googlesource.com/chromium/blink
$ git clone https://chromium.googlesource.com/chromium/src/tools/idl_parser

and then I adapted an existing script that crawls the WebIDL files below. I haven't stripped out the extraneous cruft

#!python

# To get this to run I did, in the containing directory
# $ git clone https://chromium.googlesource.com/chromium/blink
# $ git clone https://chromium.googlesource.com/chromium/src/tools/idl_parser
# and I installed py27-ply via port which provides lex.

import json
import os
import re
import regex
import sys

# Hack a py path so that I don't have to assume there's a parent directory with
# an _init_.py.
module_path, module_name = os.path.split(__file__)

# blink_idl_parser depends on idl_parser
idl_parser_dir = os.path.join(module_path, 'idl_parser')
sys.path.append(idl_parser_dir)

# Parses WebIDL with syntax extensions.
blink_idl_parser_dir = os.path.join(
    module_path, 'blink', 'Source', 'bindings', 'scripts')
assert os.path.exists(blink_idl_parser_dir), blink_idl_parser_dir
sys.path.append(blink_idl_parser_dir)

# These have to follow the sys.path muckery
from blink_idl_lexer import BlinkIDLLexer  # pylint: disable=g-import-not-at-top
from blink_idl_parser import BlinkIDLParser
from idl_node import IDLSearch
import idl_parser
from idl_parser.idl_parser import ParseFile

_SPACE_AT_EOL = re.compile(r' +$', re.M)

# http://www.ecma-international.org/ecma-262/6.0/#sec-identifier-names
_KEYWORDS = set(
    [
        'break', 'do', 'in', 'typeof',
        'case', 'else', 'instanceof', 'var',
        'catch', 'export', 'new', 'void',
        'class', 'extends', 'return', 'while',
        'const', 'finally', 'super', 'with',
        'continue', 'for', 'switch', 'yield',
        'debugger', 'function', 'this',
        'default', 'if', 'throw',
        'delete', 'import', 'try',
    ] + [
        'yield', 'let', 'static',
    ] + [
        'false', 'true', 'null'
    ] + [
        'enum', 'await',
    ] + [
        'implements', 'package', 'protected',
        'interface', 'private', 'public',
    ])

_IDENTIFIER_START = '[$_[:IdStart:]]'
_IDENTIFIER_PART = '[$_[:IdContinue:]\u200C\u200D]'

_IDENTIFIER_NAME = regex.compile(
    ur'^(?:%s%s*)\Z' % (_IDENTIFIER_START, _IDENTIFIER_PART),
    regex.U)

def isValidJsIdentifier(s):
    return s not in _KEYWORDS and _IDENTIFIER_NAME.match(s) is not None

def _strip_space_at_eol(s):
  """Strip spaces from end of line.

  Useful with json.JSONEncoder which leaves
  spaces after commas even when pretty printing.

  Args:
    s: The string to strip from.

  Returns:
    The string without runs of spaces before line ends.
  """

  return _SPACE_AT_EOL.sub('', s)

def _multimap_add(m, k, v):
  """Adds a value to a collection of values in a map.

  Args:
     m: the multimap
     k: the key
     v: a single value
  """
  if k not in m:
    m[k] = []
  m[k].append(v)

def process_files(file_nodes):
  """Processes a batch of IDL files.

  Args:
    file_nodes: IDL File AST nodes

  Raises:
    Exception: on an unrecoverable failure to extract the result.

  Returns:
    A map from names of IDL interfaces that extend Element
    to a map from JavaScript object property names to HTML attribute names.
  """

  NAMES_THAT_ARE_NOT_JS_IDENTS = []

  class Processor(IDLSearch):
    """Collects info about interface inheritance and reflected attributes."""

    # We need a minimal set of names of HTML/SVG/Math-ML element attributes
    # that differ from a reflecting DOM object property.
    #
    # To do this, we need to determine
    # 1. the set of IDL interfaces that transitively inherit from interface
    #    Element.
    #    There are two AST patterns that we look for:
    #    a. The pattern
    #      (Interface NAME=MyInterface
    #        (Inherit NAME=SuperType)
    #        ...
    #      )
    #      represents IDL of the form `interface MyInterface : SuperType`
    #      and `partial interface MyInterface : SuperType`
    #    b. Top level IDL declarations like
    #      `HTMLElement implements GlobalEventHandlers;`
    #      which is represented in the AST thus
    #      (Implements NAME=HTMLElement REFERENCE=GlobalEventHandlers)
    #
    #    DERIVATION:
    #    https://www.chromium.org/blink/webidl#TOC-dependencies lists the
    #    ways in which one IDL file can depend on any other IDL files.
    #    It includes
    #    1. partial interfaces which are irrelevant to inheritance as long
    #       as a partial interface declaration that specifies inheritance
    #       is not clobbered by one that does not.
    #    2. `implements` which is handled above
    #    3. Ancestors which talks about transitive inheritance which is
    #       handled in a post processing pass.  See compute_supertypes(...)
    #    4. Used interfaces which are irrelevant to inheritance.
    #
    # 2. The IDL attributes on those Element subtypes that reflect HTML
    #    attributes or which are settable and specially handled by the browser.
    #    To identify these we look for AST nodes like
    #    (Attribute NAME=myAttribute
    #       (ExtAttribute NAME=Reflect VALUE=...)
    #       (ExtAttribute NAME=CustomElementCallbacks)
    #       (ExtAttribute NAME=CEReactions)
    #    )
    #    and associate the attribute NAME and any Reflect VALUE with the
    #    containing interface.
    #
    #    DERIVATION:
    #    [Reflect] is defined at
    #    https://html.spec.whatwg.org/multipage/infrastructure.html#reflect
    #      Some IDL attributes are defined to reflect a particular content
    #      attribute. This means that on getting, the IDL attribute returns
    #      the current value of the content attribute, and on setting, the IDL
    #      attribute changes the value of the content attribute to the given
    #      value.
    #
    #    The optional value of the [Reflect] attribute is the name of the
    #    attribute reflected, e.g.
    #      [Reflect=class] attribute DOMString className;
    #
    #    [CustomElementCallbacks] is defined at
    #    https://chromium.googlesource.com/chromium/src/+/master
    #    /third_party/WebKit/Source/bindings/IDLExtendedAttributes.md
    #    #CustomElementCallbacks_m_a
    #      This attribute is only for Custom Elements V0, and is superseded by
    #      [CEReactions] for V1.
    #
    #    [CEReactions] is defined by https://html.spec.whatwg.org/ thus:
    #      To ensure custom element reactions are triggered appropriately,
    #      we introduce the [CEReactions] IDL extended attribute. It indicates
    #      that the relevant algorithm is to be supplemented with additional
    #      steps in order to appropriately track and invoke custom element
    #      reactions.
    #    so we recognize these three annotations to identify those IDL
    #    attributes which are of particular interest to custom elements and
    #    which associate with HTML attributes.
    #
    #    An inspection of attributes listed in
    #    https://chromium.googlesource.com/chromium/src/+/master
    #    /third_party/WebKit/Source/bindings/IDLExtendedAttributes.md
    #    indicates no others that should be relevant.
    #    Of note are:
    #    1. [ReflectEmpty], ..., [ReflectOnly] which are only used in
    #       conjunction with [Reflect].
    #    2. [PutForwards] which relates to HTML attributes which should
    #       themselves be caught by the rules above and is only used
    #       on readonly IDL attributes.

    def __init__(self):
      IDLSearch.__init__(self)
      # Maps sub-types to super-types non-transitively.
      # There will be an entry 'Sub': 'Super'
      # wherever we see
      #     interface Sub : Super { ... }
      # If there is no explicit super type then there is
      # no entry, so the keyset is not the set of all
      # interfaces.
      self.inherits = {}
      # Maps interface names to (js_property_name, html_attr_name) pairs.
      self.reflected_attrs = {}
      # Current IDL attribute name.
      self.current_attribute = None
      # Current interface name.  Not set for dictionary declarations
      # since we don't care.
      self.current_interface = None

    def Enter(self, node):
      try:
        cls = node.GetClass()
        if cls == 'Interface':
          self.current_interface = node.GetProperty('NAME')
          assert self.current_interface is not None
          if self.current_interface not in self.reflected_attrs:
            # don't clobber other parts of partial interfaces.
            # https://heycam.github.io/webidl/#dfn-partial-interface
            self.reflected_attrs[self.current_interface] = []
        elif cls == 'Inherit':
          # Dictionaries can inherit so the interface can be None.
          # We don't care.
          if self.current_interface is not None:
            _multimap_add(self.inherits, self.current_interface,
                          node.GetProperty('NAME'))
        elif cls == 'Implements':  # Top level super-type.
          _multimap_add(self.inherits, node.GetProperty('NAME'),
                        node.GetProperty('REFERENCE'))
        elif cls == 'Attribute':
          self.current_attribute = node.GetProperty('NAME')
          if not isValidJsIdentifier(self.current_attribute):
              NAMES_THAT_ARE_NOT_JS_IDENTS.append((
                  self.current_interface,
                  self.current_attribute,
                  (node.GetProperty('FILENAME'),
                   node.GetProperty('LINENO'))))
          assert self.current_attribute is not None
        elif cls == 'ExtAttribute':
          # The name of an IDL attribute corresponds to a
          # property on a JavaScript object, while the
          # value corresponds to an HTML element attribute.
          # Above, "attribute" means "IDL attribute", but
          # hereafter, "attribute" means "HTML attribute".
          prop = self.current_attribute
          if prop is not None:  # None for interface annotations.
            extattr = node.GetProperty('NAME')
            html_attr = None
            is_reflected = False
            if extattr == 'Reflect':
              html_attr = node.GetProperty('VALUE') or None
              is_reflected = True
            elif extattr in ('CustomElementCallbacks', 'CEReactions'):
              html_attr = None
              is_reflected = True
            if is_reflected and (
                # We need a table entry when our simple heuristic fails.
                # Our heuristic says that the property name is the same
                # as the canonical (lower-case) HTML attribute name.
                (html_attr or prop) != prop.lower()):
              self.reflected_attrs[self.current_interface].append(
                  (prop, html_attr))
      except:
        print >>sys.stderr, '%s:%s' % (node.GetProperty('FILENAME'),
                                       node.GetProperty('LINENO'))
        raise

    def Exit(self, node):
      cls = node.GetClass()
      if cls == 'Interface':
        interface_name = self.current_interface
        assert interface_name is not None
        self.current_interface = None

        # Some IDL attributes have both [Reflect "...",
        # CustomElementCallbacks] in which case we will have two
        # entries on the reflected list for the interface.
        # Collapse those.
        prop_to_attr = {}
        for prop, attr in self.reflected_attrs[interface_name]:
          if attr is not None or prop not in prop_to_attr:
            prop_to_attr[prop] = attr
        self.reflected_attrs[interface_name] = [
            (prop, attr or prop.lower())
            for prop, attr in prop_to_attr.iteritems()
            ]
      elif cls == 'Attribute':
        assert self.current_attribute is not None
        self.current_attribute = None

  p = Processor()
  for f in file_nodes:
    f.Traverse(p, ())

  def dump():
    print '\n\nNOT IDENTIFIERS'
    for (class_name, attribute, (fname, lineno)) in \
        NAMES_THAT_ARE_NOT_JS_IDENTS:
      print '%s in %s @ %s:%s' % (attribute, class_name, fname, lineno)
  dump()

  inherits = p.inherits
  reflected_attrs = p.reflected_attrs

  # Now that we've collected interface info,
  # we need to identify all Element sub-interfaces.
  super_type_sets = {}
  def compute_supertypes(interface_name):
    # Due to top-level `X implements Y;` declarations
    # dictionary names might reach interface_name, but
    # that shouldn't matter since we only care about
    # sub-types of Element and no dictionary is going
    # to be either a super-type nor a sub-type of Element.
    if interface_name not in super_type_sets:
      stypes = set()
      stypes.add(interface_name)
      super_type_sets[interface_name] = stypes
      for stype in inherits.get(interface_name, ()):
        compute_supertypes(stype)
        stypes.update(super_type_sets[stype])

  for interface_name in inherits.iterkeys():
    compute_supertypes(interface_name)

  # Now, find all the element sub-types, and union their
  # mixed-case properties into a mapping from lower-case
  # property names to additional property names to check.
  lcase_to_mixed = {}

  for interface_name, stypes in super_type_sets.iteritems():
    if 'Element' in stypes:
      for stype in stypes:
        if stype in reflected_attrs:
          for prop_name, attr_name in reflected_attrs[stype]:
            old_prop_name = lcase_to_mixed.get(attr_name)
            if old_prop_name is None:
              lcase_to_mixed[attr_name] = prop_name
            elif old_prop_name != prop_name:
              raise Exception('Ambiguous %s maps to %s and %s' % (
                  attr_name, prop_name, old_prop_name))

  return lcase_to_mixed

def write_js(lcase_to_mixed, out):
  """Write JavaScript to out.

  Args:
    lcase_to_mixed: maps lowercase HTML attribute names to JS property names.
    out: output stream for JS source.
  """

  prop_to_attr = {}
  noncanon_props = []
  oddities = {}

  for a, p in lcase_to_mixed.iteritems():
    # We generate this map lazily in the JS, but
    # goog.object.transpose would have undefined behavior
    # if this assert failed, so we check it here.
    assert p not in prop_to_attr
    prop_to_attr[p] = a

    if p.lower() == a:
      assert p != a
      noncanon_props.append(p)
    else:
      oddities[a] = p

  noncanon_props.sort()
  encoder = json.JSONEncoder(sort_keys=True, ensure_ascii=True, indent=2)

  print >>out, """
/**
 * @license
 * Copyright (c) 2017 The Polymer Project Authors. All rights reserved.
 * This code may only be used under the BSD style license found at
 * http://polymer.github.io/LICENSE.txt
 * The complete set of authors may be found at
 * http://polymer.github.io/AUTHORS.txt
 * The complete set of contributors may be found at
 * http://polymer.github.io/CONTRIBUTORS.txt
 * Code distributed by Google as part of the polymer project is also
 * subject to an additional IP rights grant found at
 * http://polymer.github.io/PATENTS.txt
 */

goog.provide('security.html.namealiases');

goog.require('goog.object');
goog.require('goog.string');

/**
 * @fileoverview
 * Provides a mapping from HTML attribute to JS object property names.
 */

/**
 * Maps JavaScript object property names to HTML attribute names.
 *
 * @param {string} propName a JavaScript object property name.
 * @return {string} an HTML element attribute name.
 */
security.html.namealiases.propertyToAttr = function (propName) {
  var propToAttr = security.html.namealiases.propToAttr_;
  if (!propToAttr) {
    var attrToProp = security.html.namealiases.getAttrToProp_();
    propToAttr = security.html.namealiases.propToAttr_ =
        goog.object.transpose(attrToProp);
  }
  var attr = propToAttr[propName];
  if (goog.isString(attr)) {
    return attr;
  }
  // Arguably we could do propName.toLowerCase, but these
  // two functions should be inverses.
  return goog.string.toSelectorCase(propName);
};

/**
 * Maps HTML attribute names to JavaScript object property names.
 *
 * @param {string} attrName an HTML element attribute name.
 * @return {string} a JavaScript object property name.
 */
security.html.namealiases.attrToProperty = function (attrName) {
  var canonAttrName = String(attrName).toLowerCase();
  var attrToProp = security.html.namealiases.getAttrToProp_();
  var prop = attrToProp[canonAttrName];
  if (goog.isString(prop)) {
    return prop;
  }
  return goog.string.toCamelCase(canonAttrName);
};

/**
 * Instead of trusting a property name, we assume the worst and
 * try to map it to a property name with known special semantics.
 *
 * @param {string} name a JavaScript object property or HTML attribute name.
 * @return {?string} a JavaScript object property name if there is a special
 *   mapping that is different from that given.
 */
security.html.namealiases.specialPropertyNameWorstCase = function (name) {
  var lcname = name.toLowerCase();
  var attrToProp = security.html.namealiases.getAttrToProp_();
  var prop = attrToProp[lcname];
  if (goog.isString(prop)) {
    return prop;
  }
  return null;
};

/**
 * Returns a mapping from lower-case HTML attribute names to
 * property names that reflect those attributes.
 *
 * @return {!Object.<string, string>}
 * @private
 */
security.html.namealiases.getAttrToProp_ = function () {
  if (!security.html.namealiases.attrToProp_) {
    security.html.namealiases.attrToProp_ = goog.object.clone(
        security.html.namealiases.ODD_ATTR_TO_PROP_);
    var noncanon = security.html.namealiases.NONCANON_PROPS_;
    for (var i = 0, n = noncanon.length; i < n; ++i) {
      var name = noncanon[i];
      security.html.namealiases.attrToProp_[name.toLowerCase()] = name;
    }
  }
  return security.html.namealiases.attrToProp_;
};

/**
 * Mixed-case property names that correspond directly to an attribute
 * name ignoring case.
 *
 * @type {!Array.<string>}
 * @const
 * @private
 */
security.html.namealiases.NONCANON_PROPS_ = %(noncanon_props)s;

/**
 * Attribute name to property name mappings that are neither identity
 * nor simple lowercasing, like {@code "htmlFor"} -> {@code "for"}.
 *
 * @type {!Object.<string, string>}
 * @private
 */
security.html.namealiases.ODD_ATTR_TO_PROP_ = %(oddities)s;

/**
 * Maps lower-case HTML attribute names to property names that reflect
 * those attributes.
 *
 * <p>
 * This is initialized to a partial value that is then lazily fleshed out
 * based on ODD_ATTR_TO_PROP_ and NONCANON_PROPS_.
 * </p>
 *
 * @type {?Object.<string, string>}
 * @private
 */
security.html.namealiases.attrToProp_ = null;

/**
 * Maps property names to lower-case HTML attribute names
 * that are reflected by those properties.
 *
 * Lazily generated from attrToProp_.
 *
 * @type {?Object.<string, string>}
 * @private
 */
security.html.namealiases.propToAttr_ = null;
""".strip() % {
    'module_name': module_name,
    'noncanon_props': _strip_space_at_eol(encoder.encode(noncanon_props)),
    'oddities': _strip_space_at_eol(encoder.encode(oddities)),
    }

if __name__ == '__main__':
  parser = BlinkIDLParser(BlinkIDLLexer())
  element_to_reflected_attribute = process_files([
      ParseFile(parser, idl_file)
      for idl_file in sys.argv[1:]
  ])
  # write_js(element_to_reflected_attribute, sys.stdout)

and finally I ran it thus

$ find blink/ -name \*.idl | grep -v InspectorInstrumentation | xargs python2.7 crawl_idl.py

The InspectorInstrumentation.idl file seems to have some cpp directives which break the IDL parser so I skipped it. I think it's an input to a codegenerator for properly-formed IDL files.

mikesamuel commented 7 years ago

I'm ignoring class in TestObject since that's probably not reachable from Window.

import is readonly in HTMLLinkElement.idl:L50

   // HTML Imports
   // https://w3c.github.io/webcomponents/spec/imports/#interface-import
   readonly attribute Document? import;

default is writable per HTMLMenuItemElement

    [CEReactions, Reflect] attribute boolean default;
caridy commented 4 years ago

related to the shim, closing in favor of the other repo.