spdx / license-list-XML

This is the repository for the master files that comprise the SPDX License List
Other
352 stars 286 forks source link

Malformed HTML in multiple license files #1680

Open ghost opened 2 years ago

ghost commented 2 years ago

A similar problem raised in issue #1673 exists with CC-BY-2.5.html in that the nesting of <ul> within <ul> is invalid (the nested <ul> must be inside an <li> element):

source

I suspected that these types of errors are systemic, given that the same issue was encountered twice. Consider updating your processes to run the HTML through the W3C's validator service to check the HTML files so that all these types of errors can be corrected at one time, rather than waiting on individual issues being raised on a per-license basis.

From Linux, this can be accomplished as follows:

cd /tmp
git clone https://github.com/spdx/license-list-data/
cd license*/html
for i in *.html; do
  echo "--------------------------";
  echo "$i";
  echo "--------------------------";
  curl -H "Content-Type: text/html; charset=utf-8" --data-binary @$i "https://validator.w3.org/nu/?out=gnu" | grep -v "1.1-2.*";
  echo "";
done > results.txt

See attached:

results.txt.gz

Further filter the file using something like:

$ grep -v slash results.txt | grep -v 1.1-3 | grep -B2 "error:" | grep html$
AAL.html
Aladdin.html
APL-1.0.html
APSL-1.0.html
APSL-1.1.html
APSL-1.2.html
Artistic-2.0.html
Bitstream-Vera.html
BSD-3-Clause-Clear.html
BSD-Protection.html
Caldera.html
CC-BY-2.5-AU.html
CC-BY-2.5.html
CC-BY-NC-SA-1.0.html
CC-BY-SA-2.0.html
CDL-1.0.html
CECILL-1.1.html
LGPL-2.0.html
LGPL-2.0+.html
LGPL-2.0-only.html
LGPL-2.0-or-later.html
libpng-2.0.html
Motosoto.html
NBPL-1.0.html
OLDAP-1.1.html
OLDAP-1.2.html
OLDAP-1.3.html
OLDAP-1.4.html
OPUBL-1.0.html

That gives a good starting place for addressing most of the non-validating HTML files.

goneall commented 2 years ago

Thanks @DaveJarvis for the analysis.

In looking at the XML for the first few, the <list> elements contain other <list> elements which generates the erroneous HTML.

This is explicitly allowed in the schema file (see https://github.com/spdx/license-list-XML/blob/a8f83ee3dfa94b01274e16f8a1b73f92dea79275/schema/ListedLicense.xsd#L319).

We can update the schema to disallow lists within lists which will prevent future submittals of license XML's with this error.

Unfortunately, it won't pass the CI until all the above licenses with this error are manually fixed - which is quite an effort.

I'll create a draft PR with the updated schema, but I won't have time to fix all the licenses. If there are any volunteers willing to help out, you can create a PR against the PR with the schema update. Once all the XML's are fixed up, we can merge the schema update and the fixed licenses.

ghost commented 2 years ago

Unfortunately, it won't pass the CI until all the above licenses with this error are manually fixed - which is quite an effort.

If it's all XML, there should be an XSL transformation that can be applied to fix them en masse. If you're not familiar with XSL, a StackOverflow guru might be willing to help.

Perhaps @michaelhkay (https://github.com/michaelhkay) may have some recommendations?

goneall commented 2 years ago

If it's all XML, there should be an XSL transformation that can be applied to fix them en masse. If you're not familiar with XSL, a StackOverflow guru might be willing to help.

That sounds like an excellent idea. I'm not much of an XML expert, so help with the transform would be great. I fact I could use some help getting the XSD to work - see the comment in PR #1681.

goneall commented 2 years ago

With the help of @zvr I was able to create a schema update that catches this error.

There are 45 licenses that will need to be fixed for them to pass the schema validation. This should also fix the malformed HTML next time the license list is published.

Anyone able to help could create a pull request against the branch listschema which can then be merged into PR #1681 before merging into main.

I'll leave the PR in draft mode until we have all the licenses fixed and ready to go.

jlovejoy commented 2 years ago

so, will this mean that we can't have nested lists?

goneall commented 2 years ago

so, will this mean that we can't have nested lists?

@jlovejoy - no, you can still have nested lists. You just need to nest them as follows:

<list>
   <listItem>
      <list> ...

What we are disallowing is the following:

<list>
  <list>

which generates illegal HTML.

ghost commented 2 years ago

so, will this mean that we can't have nested lists?

Lists can be nested. The following fails W3C validation due to improper nesting (an ol or ul cannot be a direct child of an ol or ul element):

<ol>
  <li>1</li>
  <ol>
    <li>a</li>
    <ol>
      <li>i</li>
    </ol>
  </ol>
</ol>

Properly nested to pass W3C validation:

<ol>
  <li>1
  <ol>
    <li>a
    <ol>
      <li>i</li> <!-- close the 'i' list item -->
    </ol>
    </li> <!-- close the 'a' list item -->
  </ol>
  </li> <!-- close the '1' list item -->
</ol>

Most browsers will probably render them fine in practice, but strictly speaking the former does not conform to the standard.

swinslow commented 1 year ago

Is this solved now with #1681? With the schema change there and the licenses where I added <item> tags, I think that fixes most of the licenses that @DaveJarvis listed, but maybe not all of them.

I didn't change the following ones from the list above:

goneall commented 1 year ago

@DaveJarvis Would you mind running the script again with the latest license-list-data? I merged #1681 which should resolve most of the errors, but from the list above, it looks like there still may be other errors - perhaps from a different cause.

ghost commented 1 year ago

Full log: results.txt.gz

$ grep -v slash results.txt | grep -v 1.1-3 | grep -B2 "error:" | grep html$
APL-1.0.html
Bitstream-Vera.html
Caldera.html
CDL-1.0.html
CECILL-1.1.html
libpng-2.0.html
OPUBL-1.0.html

As well as:

$ grep -B2 "error:" results.txt 
:1015.52-1015.57: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:1017.59-1017.64: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:1142.16-1142.19: error: No “p” element in scope but a “p” end tag seen.
--
App-s2p.html
--------------------------
:1.1-3.9: error: Non-space characters found without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.9: error: Element “head” is missing a required instance of child element “title”.
--
Bitstream-Charter.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
Bitstream-Vera.html
--------------------------
:1.1-1.3: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-1.3: error: Element “head” is missing a required instance of child element “title”.
--
Caldera.html
--------------------------
:41.7-41.10: error: No “p” element in scope but a “p” end tag seen.
--
CDL-1.0.html
--------------------------
:39.6-39.9: error: No “p” element in scope but a “p” end tag seen.
:65.6-65.9: error: No “p” element in scope but a “p” end tag seen.
--
--------------------------
:8.9-8.14: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:29.10-29.13: error: No “p” element in scope but a “p” end tag seen.
:57.9-57.12: error: No “p” element in scope but a “p” end tag seen.
--
checkmk.html
--------------------------
:1.1-3.44: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.44: error: Element “head” is missing a required instance of child element “title”.
--
CNRI-Jython.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
diffmark.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
etalab-2.0.html
--------------------------
:1.1-3.41: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.41: error: Element “head” is missing a required instance of child element “title”.
--
libpng-2.0.html
--------------------------
:56.7-56.10: error: No “p” element in scope but a “p” end tag seen.
--
OPUBL-1.0.html
--------------------------
:63.6-63.9: error: No “p” element in scope but a “p” end tag seen.
:81.6-81.9: error: No “p” element in scope but a “p” end tag seen.
--
Ruby.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
SchemeReport.html
--------------------------
:1.1-3.9: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.9: error: Element “head” is missing a required instance of child element “title”.
swinslow commented 1 year ago

Thanks @DaveJarvis! Looking briefly through the errors:

ghost commented 1 year ago

should this be changed in these XML files

Making them valid XML allows for XSLT preprocessing.

goneall commented 1 year ago

Thanks @DaveJarvis and @swinslow for the additional analysis.

I see from the main license-list-data README that the html versions are just meant to be simple HTML snippets, not fully valid HTML files, so should these be disregarded?

Yes - these are snippets to be included inside another "well formed" HTML page, so I think we can ignore these.

Would this be valid, or should this be changed in these XML files to close the starting

tag before the starts?

I found this Stack Overflow article which suggests the paragraph tags should be close before the list elements.

We should probably change the XML and update the XML schema to make future occurrences an error. I added issue #1758 to update the schema. Since it is a bit involved, I won't be tackling this issue within the next week or so, but I'll try to get back to it after the holidays.

swinslow commented 1 year ago

Thank you both.

Even though #1758 is now added to track the specific "lists in paragraphs" issue, for the time being I'll leave this one open as well. After #1758 / #1762 are resolved, @DaveJarvis perhaps you can then re-run your checks to see if any issues are still showing up. Thanks!

ghost commented 1 year ago

Ping me when the main branch is updated.

jlovejoy commented 1 year ago

where are we at with this?

swinslow commented 1 year ago

This is related to #1758 / #1762. I need to refresh my recollection on where these are at following the PR that @goneall and I worked on in Oct. / Dec. in #1681. That said, the related issues are moved to 3.21 so I'm moving this as well.

jlovejoy commented 1 year ago

@goneall @swinslow - is this still an open issue?

ghost commented 1 year ago

FYI, in two days I will be locked out of my account once GitHub begins enforcing MFA. Sorry I won't be able to help any further.

jlovejoy commented 6 months ago

@swinslow @goneall - I'm moving this to "later release" but wuold be good to get to closure!