Open jamin-aws-ospo opened 1 year ago
Are these due to some kind of encoding issue? These are copied from the test license text which are typically copy/pasted from the original license.
@jamin-aws-ospo Can you check of the origin test licenses have the same issues? That would help isolate if this is a source problem or a tooling problem.
I'd be happy to check them, but I'm not sure what source(s) you consider to be the original.
For instance, what is the original (or official source of the) APL-1.0?
I'd be happy to check them, but I'm not sure what source(s) you consider to be the original.
I would just look at the files in this test file directory and see if you're seeing the same issue: https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator
I'm not sure I would actually call these file "original", but they are the source files used to populate the spdx/license-list-data/text/
directory. The text in the test directory was submitted when the license XML files were submitted. They came from a lot of different sources. It would be up to the submitter to decide what the original (or official) source is. There are cases where there is no determinate original text version of the license.
If you don't see the same issue in the test directory, I would suspect an issue in the tool that generated the license list data.
The issue is present with each of the files listed in the referenced test file directory.
Thanks @jamin-aws-ospo for the analysis.
It looks like this is an issue with the source of the text files.
If we want to clean these up, we can just do a pull request on the source files.
To prevent future issues like this, we can add a check to the CI for this repo to check for unprintable characters.
Just as an FYI - the SPDX license matcher utility handles many of the unprintable characters by normalizing them to a space.
I believe the text files should be cleaned/purged of non-printing characters. I also believe that (like myself) others will use the text files as direct inputs/reference for other license based tooling, rather than using the existing license matching utility.
I found this issue because of differences between what different languages consider to be whitespace. Seems that not all languages consider non-breaking space to be part of the POSIX space regular expression character class.
I believe the text files should be cleaned/purged of non-printing characters. I also believe that (like myself) others will use the text files as direct inputs/reference for other license based tooling, rather than using the existing license matching utility.
I agree - I just checked the advice for adding the text files and it does mention removing some of the special characters. So changing the existing .txt files to match this advice seems entirely appropriate.
@swinslow - any concerns with this?
I found this issue because of differences between what different languages consider to be whitespace. Seems that not all languages consider non-breaking space to be part of the POSIX space regular expression character class.
@jamin-aws-ospo Makes sense. I had to apply custom coding to handle this (and several other non-printable characters) in the SPDX license matching tools used by the SPDX online tools.
I had to make those changes anyway since the text being compared to (as apposed to the license text) also contained these characters.
I essentially normalized the to sides of the text being compared before applying any regular expressions or token based matching algorithms.
It definitely slows down the match, but it solves the problem.
I essentially normalized the to sides of the text being compared before applying any regular expressions or token based matching algorithms.
I'm doing a similar normalization here, and comparing (what should be) the same normalization in two different languages. Which is how I found this issue.
I'm not sure I understand what the issue is. While in some cases the non-breaking space is probably an artifact of some copy-pasting, in other cases it's completely corret. For example, in the cases where the text is in French and non-breaking space is next to guillemets. Since we don't "ASCII-fy" the text, and we leave curly quotes, different dashes, etc., I don't see why these spaces pose an issue.
I'd be more interested to learn which environments do not consider this character as a space, since this is definitely a bug.
@swinslow - any concerns with this?
@goneall No problem from my side if someone wants to submit a PR to replace non-breaking spaces with regular space characters.
To @zvr's point, downstream tools are going to have to take into account the fact that license texts are messy -- even the rare few that in fact have a license-steward-maintained "canonical" plain text version. So while we can clean up the plain-text versions used for the test text files for license-list-XML, anyone using them should be aware of what they are and what they aren't, and will likely have to handle special characters appropriately when encountering them in practice.
I'd be more interested to learn which environments do not consider this character as a space, since this is definitely a bug.
Ruby's regex \s+
does not match it (which is how I found it), while [[:space:]]+
does.
To be fair, this is documented: copying from https://ruby-doc.org/3.2.1/Regexp.html
/\s/
- character class: /[ \t\r\n\f\v]/
@zvr What do you think about updating text files to normalize the spaces?
If we do normalize the spaces, we should also update the CI to check for these characters in the test files.
Created a pull request that removes these characters: #1901
As I wrote above
Since we don't "ASCII-fy" the text, and we leave curly quotes, different dashes, etc., I don't see why these spaces pose an issue.
Why treat these as special?
My pitch from the sidelines would be because they are invisible to even a technical user until they discover the bugs that these invisible characters lead to. They provide no value, just cost.
Fully agreed that valuable characters, for example non-breaking spaces next to guillemets, should be kept. If any of those are being removed in this PR, please call them out.
I've noticed that non-breaking spaces (and other non-printable characters) are present in several of the text versions of licenses.
non-breaking space (
M-BM-
as output bycat -A
, or hex 'A0'):Another, unknown non-printing sequence (
M-bM-^@M-
as output bycat -A
) is present in: