Open ghost opened 2 years ago
Thanks @DaveJarvis for the analysis.
In looking at the XML for the first few, the <list>
elements contain other <list>
elements which generates the erroneous HTML.
This is explicitly allowed in the schema file (see https://github.com/spdx/license-list-XML/blob/a8f83ee3dfa94b01274e16f8a1b73f92dea79275/schema/ListedLicense.xsd#L319).
We can update the schema to disallow lists within lists which will prevent future submittals of license XML's with this error.
Unfortunately, it won't pass the CI until all the above licenses with this error are manually fixed - which is quite an effort.
I'll create a draft PR with the updated schema, but I won't have time to fix all the licenses. If there are any volunteers willing to help out, you can create a PR against the PR with the schema update. Once all the XML's are fixed up, we can merge the schema update and the fixed licenses.
Unfortunately, it won't pass the CI until all the above licenses with this error are manually fixed - which is quite an effort.
If it's all XML, there should be an XSL transformation that can be applied to fix them en masse. If you're not familiar with XSL, a StackOverflow guru might be willing to help.
Perhaps @michaelhkay (https://github.com/michaelhkay) may have some recommendations?
If it's all XML, there should be an XSL transformation that can be applied to fix them en masse. If you're not familiar with XSL, a StackOverflow guru might be willing to help.
That sounds like an excellent idea. I'm not much of an XML expert, so help with the transform would be great. I fact I could use some help getting the XSD to work - see the comment in PR #1681.
With the help of @zvr I was able to create a schema update that catches this error.
There are 45 licenses that will need to be fixed for them to pass the schema validation. This should also fix the malformed HTML next time the license list is published.
Anyone able to help could create a pull request against the branch listschema
which can then be merged into PR #1681 before merging into main.
I'll leave the PR in draft mode until we have all the licenses fixed and ready to go.
so, will this mean that we can't have nested lists?
so, will this mean that we can't have nested lists?
@jlovejoy - no, you can still have nested lists. You just need to nest them as follows:
<list>
<listItem>
<list> ...
What we are disallowing is the following:
<list>
<list>
which generates illegal HTML.
so, will this mean that we can't have nested lists?
Lists can be nested. The following fails W3C validation due to improper nesting (an ol
or ul
cannot be a direct child of an ol
or ul
element):
<ol>
<li>1</li>
<ol>
<li>a</li>
<ol>
<li>i</li>
</ol>
</ol>
</ol>
Properly nested to pass W3C validation:
<ol>
<li>1
<ol>
<li>a
<ol>
<li>i</li> <!-- close the 'i' list item -->
</ol>
</li> <!-- close the 'a' list item -->
</ol>
</li> <!-- close the '1' list item -->
</ol>
Most browsers will probably render them fine in practice, but strictly speaking the former does not conform to the standard.
Is this solved now with #1681? With the schema change there and the licenses where I added <item>
tags, I think that fixes most of the licenses that @DaveJarvis listed, but maybe not all of them.
I didn't change the following ones from the list above:
@DaveJarvis Would you mind running the script again with the latest license-list-data? I merged #1681 which should resolve most of the errors, but from the list above, it looks like there still may be other errors - perhaps from a different cause.
Full log: results.txt.gz
$ grep -v slash results.txt | grep -v 1.1-3 | grep -B2 "error:" | grep html$
APL-1.0.html
Bitstream-Vera.html
Caldera.html
CDL-1.0.html
CECILL-1.1.html
libpng-2.0.html
OPUBL-1.0.html
As well as:
$ grep -B2 "error:" results.txt
:1015.52-1015.57: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:1017.59-1017.64: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:1142.16-1142.19: error: No “p” element in scope but a “p” end tag seen.
--
App-s2p.html
--------------------------
:1.1-3.9: error: Non-space characters found without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.9: error: Element “head” is missing a required instance of child element “title”.
--
Bitstream-Charter.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
Bitstream-Vera.html
--------------------------
:1.1-1.3: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-1.3: error: Element “head” is missing a required instance of child element “title”.
--
Caldera.html
--------------------------
:41.7-41.10: error: No “p” element in scope but a “p” end tag seen.
--
CDL-1.0.html
--------------------------
:39.6-39.9: error: No “p” element in scope but a “p” end tag seen.
:65.6-65.9: error: No “p” element in scope but a “p” end tag seen.
--
--------------------------
:8.9-8.14: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:29.10-29.13: error: No “p” element in scope but a “p” end tag seen.
:57.9-57.12: error: No “p” element in scope but a “p” end tag seen.
--
checkmk.html
--------------------------
:1.1-3.44: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.44: error: Element “head” is missing a required instance of child element “title”.
--
CNRI-Jython.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
diffmark.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
etalab-2.0.html
--------------------------
:1.1-3.41: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.41: error: Element “head” is missing a required instance of child element “title”.
--
libpng-2.0.html
--------------------------
:56.7-56.10: error: No “p” element in scope but a “p” end tag seen.
--
OPUBL-1.0.html
--------------------------
:63.6-63.9: error: No “p” element in scope but a “p” end tag seen.
:81.6-81.9: error: No “p” element in scope but a “p” end tag seen.
--
Ruby.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
SchemeReport.html
--------------------------
:1.1-3.9: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.9: error: Element “head” is missing a required instance of child element “title”.
Thanks @DaveJarvis! Looking briefly through the errors:
<!DOCTYPE html>
and <head>
elements at the start. @goneall I see from the main license-list-data README that the html
versions are just meant to be simple HTML snippets, not fully valid HTML files, so should these be disregarded?<list>
element sitting in the middle of <p></p>
tags. Would this be valid, or should this be changed in these XML files to close the starting <p>
tag before the <list>
starts?should this be changed in these XML files
Making them valid XML allows for XSLT preprocessing.
Thanks @DaveJarvis and @swinslow for the additional analysis.
I see from the main license-list-data README that the html versions are just meant to be simple HTML snippets, not fully valid HTML files, so should these be disregarded?
Yes - these are snippets to be included inside another "well formed" HTML page, so I think we can ignore these.
Would this be valid, or should this be changed in these XML files to close the starting
tag before the
starts?
I found this Stack Overflow article which suggests the paragraph tags should be close before the list elements.
We should probably change the XML and update the XML schema to make future occurrences an error. I added issue #1758 to update the schema. Since it is a bit involved, I won't be tackling this issue within the next week or so, but I'll try to get back to it after the holidays.
Thank you both.
Even though #1758 is now added to track the specific "lists in paragraphs" issue, for the time being I'll leave this one open as well. After #1758 / #1762 are resolved, @DaveJarvis perhaps you can then re-run your checks to see if any issues are still showing up. Thanks!
Ping me when the main branch is updated.
where are we at with this?
This is related to #1758 / #1762. I need to refresh my recollection on where these are at following the PR that @goneall and I worked on in Oct. / Dec. in #1681. That said, the related issues are moved to 3.21 so I'm moving this as well.
@goneall @swinslow - is this still an open issue?
FYI, in two days I will be locked out of my account once GitHub begins enforcing MFA. Sorry I won't be able to help any further.
@swinslow @goneall - I'm moving this to "later release" but wuold be good to get to closure!
A similar problem raised in issue #1673 exists with CC-BY-2.5.html in that the nesting of
<ul>
within<ul>
is invalid (the nested<ul>
must be inside an<li>
element):source
I suspected that these types of errors are systemic, given that the same issue was encountered twice. Consider updating your processes to run the HTML through the W3C's validator service to check the HTML files so that all these types of errors can be corrected at one time, rather than waiting on individual issues being raised on a per-license basis.
From Linux, this can be accomplished as follows:
See attached:
results.txt.gz
Further filter the file using something like:
That gives a good starting place for addressing most of the non-validating HTML files.