libwww-perl / HTML-Tagset

HTML::Tagset, a Perl module for helping parse HTML
0 stars 5 forks source link

Various minor changes needed #5

Open PhilterPaper opened 6 months ago

PhilterPaper commented 6 months ago

Here are some HTML v4 changes needed, and some minor other things.

  1. add 'plaintext' to %isOptionalEndTag (admittedly very rarely used)
  2. add 'svg' => ['xmlns', 'xmlns:xlink', 'xmlns:svg', 'xlink:href'], to %linkElements ('svg' is a legit v4 tag, although I'm not sure how much detail you want to get into on its child tags)
  3. add 'reversed' to list of 'ol' attributes in %boolean_attr (requires turning the list into { 'name' => 1} format)
  4. don't duplicate any tag entry in the lists: build up lists from other lists, if possible, adding only tags which don't appear in one of the sublists
  5. in %isPhraseMarkup, add 'svg', 'bdi', 'data', 'iframe', 'picture', 'object', 'param', 'plaintext', 'xmp', 'listing', 'ilayer'
  6. MANIFEST file should NOT include MANIFEST itself

Per 74627, we need to figure out just what exactly is needed for various lists. See also PhilterPaper/HTML-Tagset, in which I have commented out all the HTML 5 tags, leaving only HTML 4. #2 for discussing putting them back in, in some way.

PhilterPaper commented 6 months ago

I also rearranged the code in my copy, PhilterPaper/HTML-Tagset, (per item 4) to define tags in only one place, and use sub-lists to consolidate into higher level lists. For example (per #9), there is a %isHeadOrBodyElement that becomes part of %isHeadElement (along with list of head-only tags).

There may be some missing tags that fall into sub-lists (e.g., 'area' under 'map') that may need to be dealt with, and we need to make sure that all lists end up containing the right tags, regardless of how they're built up.

My version also contains a full list of attributes for each tag, in case that's useful. I also propose to list all children of a given tag (mandatory children and allowed children). Such a list needs to be searchable from the bottom-up, to see if a given tag is allowed to be under (child of) another one. For example, <tr> can be the direct child of <table> or it can be under <tbody> (and possibly other tags).

Different online lists of tags and attributes don't seem to fully agree with each other, so even v4 is still something of a moving target. Discussion is needed.