non-breaking spaces present in several text versions of licenses

jamin-aws-ospo commented 1 year ago

I've noticed that non-breaking spaces (and other non-printable characters) are present in several of the text versions of licenses.

non-breaking space (M-BM- as output by cat -A, or hex 'A0'):

$ grep -lP '\xa0' spdx/license-list-data/text/*
spdx/license-list-data/text/APL-1.0.txt
spdx/license-list-data/text/GD.txt
spdx/license-list-data/text/IBM-pibs.txt
spdx/license-list-data/text/LAL-1.2.txt
spdx/license-list-data/text/LiLiQ-P-1.1.txt
spdx/license-list-data/text/LiLiQ-R-1.1.txt
spdx/license-list-data/text/LiLiQ-Rplus-1.1.txt
spdx/license-list-data/text/NCGL-UK-2.0.txt
spdx/license-list-data/text/NPL-1.1.txt
spdx/license-list-data/text/OCCT-PL.txt
spdx/license-list-data/text/OGL-UK-1.0.txt
spdx/license-list-data/text/SHL-2.0.txt
spdx/license-list-data/text/SHL-2.1.txt
spdx/license-list-data/text/SISSL.txt
spdx/license-list-data/text/W3C-19980720.txt

Another, unknown non-printing sequence (M-bM-^@M- as output by cat -A) is present in:

spdx/license-list-data/text/APL-1.0.txt
spdx/license-list-data/text/OGL-UK-1.0.txt
spdx/license-list-data/text/OSET-PL-2.1.txt

goneall commented 1 year ago

Are these due to some kind of encoding issue? These are copied from the test license text which are typically copy/pasted from the original license.

@jamin-aws-ospo Can you check of the origin test licenses have the same issues? That would help isolate if this is a source problem or a tooling problem.

jamin-aws-ospo commented 1 year ago

I'd be happy to check them, but I'm not sure what source(s) you consider to be the original.

For instance, what is the original (or official source of the) APL-1.0?

goneall commented 1 year ago

I'd be happy to check them, but I'm not sure what source(s) you consider to be the original.

I would just look at the files in this test file directory and see if you're seeing the same issue: https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator I'm not sure I would actually call these file "original", but they are the source files used to populate the spdx/license-list-data/text/ directory. The text in the test directory was submitted when the license XML files were submitted. They came from a lot of different sources. It would be up to the submitter to decide what the original (or official) source is. There are cases where there is no determinate original text version of the license.

If you don't see the same issue in the test directory, I would suspect an issue in the tool that generated the license list data.

jamin-aws-ospo commented 1 year ago

The issue is present with each of the files listed in the referenced test file directory.

goneall commented 1 year ago

Thanks @jamin-aws-ospo for the analysis.

It looks like this is an issue with the source of the text files.

If we want to clean these up, we can just do a pull request on the source files.

To prevent future issues like this, we can add a check to the CI for this repo to check for unprintable characters.

Just as an FYI - the SPDX license matcher utility handles many of the unprintable characters by normalizing them to a space.

jamin-aws-ospo commented 1 year ago

I believe the text files should be cleaned/purged of non-printing characters. I also believe that (like myself) others will use the text files as direct inputs/reference for other license based tooling, rather than using the existing license matching utility.

jamin-aws-ospo commented 1 year ago

I found this issue because of differences between what different languages consider to be whitespace. Seems that not all languages consider non-breaking space to be part of the POSIX space regular expression character class.

goneall commented 1 year ago

I believe the text files should be cleaned/purged of non-printing characters. I also believe that (like myself) others will use the text files as direct inputs/reference for other license based tooling, rather than using the existing license matching utility.

I agree - I just checked the advice for adding the text files and it does mention removing some of the special characters. So changing the existing .txt files to match this advice seems entirely appropriate.

@swinslow - any concerns with this?

goneall commented 1 year ago

I found this issue because of differences between what different languages consider to be whitespace. Seems that not all languages consider non-breaking space to be part of the POSIX space regular expression character class.

@jamin-aws-ospo Makes sense. I had to apply custom coding to handle this (and several other non-printable characters) in the SPDX license matching tools used by the SPDX online tools.

I had to make those changes anyway since the text being compared to (as apposed to the license text) also contained these characters.

I essentially normalized the to sides of the text being compared before applying any regular expressions or token based matching algorithms.

It definitely slows down the match, but it solves the problem.

jamin-aws-ospo commented 1 year ago

I essentially normalized the to sides of the text being compared before applying any regular expressions or token based matching algorithms.

I'm doing a similar normalization here, and comparing (what should be) the same normalization in two different languages. Which is how I found this issue.

zvr commented 1 year ago

I'm not sure I understand what the issue is. While in some cases the non-breaking space is probably an artifact of some copy-pasting, in other cases it's completely corret. For example, in the cases where the text is in French and non-breaking space is next to guillemets. Since we don't "ASCII-fy" the text, and we leave curly quotes, different dashes, etc., I don't see why these spaces pose an issue.

I'd be more interested to learn which environments do not consider this character as a space, since this is definitely a bug.

swinslow commented 1 year ago

@swinslow - any concerns with this?

@goneall No problem from my side if someone wants to submit a PR to replace non-breaking spaces with regular space characters.

To @zvr's point, downstream tools are going to have to take into account the fact that license texts are messy -- even the rare few that in fact have a license-steward-maintained "canonical" plain text version. So while we can clean up the plain-text versions used for the test text files for license-list-XML, anyone using them should be aware of what they are and what they aren't, and will likely have to handle special characters appropriately when encountering them in practice.

jamin-aws-ospo commented 1 year ago

I'd be more interested to learn which environments do not consider this character as a space, since this is definitely a bug.

Ruby's regex \s+ does not match it (which is how I found it), while [[:space:]]+ does.

zvr commented 1 year ago

To be fair, this is documented: copying from https://ruby-doc.org/3.2.1/Regexp.html

/\s/ - character class: /[ \t\r\n\f\v]/

goneall commented 1 year ago

@zvr What do you think about updating text files to normalize the spaces?

If we do normalize the spaces, we should also update the CI to check for these characters in the test files.

jamin-aws-ospo commented 1 year ago

Created a pull request that removes these characters: #1901

zvr commented 1 year ago

As I wrote above

Since we don't "ASCII-fy" the text, and we leave curly quotes, different dashes, etc., I don't see why these spaces pose an issue.

Why treat these as special?

hyandell commented 1 year ago

My pitch from the sidelines would be because they are invisible to even a technical user until they discover the bugs that these invisible characters lead to. They provide no value, just cost.

Fully agreed that valuable characters, for example non-breaking spaces next to guillemets, should be kept. If any of those are being removed in this PR, please call them out.

spdx / license-list-XML

non-breaking spaces present in several text versions of licenses #1893