spdx / tools-python

A Python library to parse, validate and create SPDX documents.
http://spdx.org
Apache License 2.0
172 stars 128 forks source link

Question regarding license_expression_parser behavior #806

Open billie-alsup opened 1 month ago

billie-alsup commented 1 month ago

src/spdx_tools/spdx/parser/jsonlikedict/license_expression_parser.py uses License().parse(expr) directly, rather than get_spdx_licensing().parse(expr) as used in parser/tagvalue/parser.py. The difference results in a different LicenseSymbol for GPl-2.0, e.g.

>>> from license_expression import Licensing
>>> Licensing().parse('GPL-2.0')
LicenseSymbol('GPL-2.0', is_exception=False)
>>> from license_expression import get_spdx_licensing
>>> get_spdx_licensing().parse('GPL-2.0')
LicenseSymbol('GPL-2.0-only', aliases=('GPL-2.0', 'GPL 2.0', 'LicenseRef-GPL-2.0'), is_exception=False)
>>> 

As you can see, GPL-2.0-only is the official name, and GPL-2.0 is an alias. However, when parsing directly with Licensing(), we get a GPL-2.0 node, rather than a GPL-2.0-only node. This causes problem later in validation, when GPL-2.0 comes back as an invalid symbol, e.g.

2024-06-18 16:28:31,476:WARNING:root: Unrecognized license reference: GPL-2.0. license_expression must only use IDs from the license list or extracted licensing info, but is: GPL-2.0
2024-06-18 16:28:31,476:WARNING:root: ValidationContext(spdx_id=None, parent_id='SPDXRef-base-files-Package-base-files', element_type=<SpdxElementType.LICENSE_EXPRESSION: 1>, full_element=LicenseSymbol('GPL-2.0', is_exception=False))

I'm wondering if this is expected behavior (and you do not with to allow aliases), or if this is a bug. Should I filter my json file in advance to switch to GPL-2.0-only ? Certainly GPL-2.0 should not be listed in the extracted_licensing_info section (as that would require changing it to LicenseRef-GPL-2.0 or similar), right?

billie-alsup commented 1 month ago

This is just one example. Validating spdx json files generated from OpenEmbedded project yields numerous errors where aliases were used for licensing, versus the "official" node license text. In addition to GPL-2.0, there is GPL-3.0 and LGPL-2.1+. Again, this results in incorrect validation errors.

>>> from license_expression import get_spdx_licensing
>>> get_spdx_licensing().parse('LGPL-2.1+')
LicenseSymbol('LGPL-2.1-or-later', aliases=('LGPL-2.1+',), is_exception=False)
>>> get_spdx_licensing().parse('GPL-3.0')
LicenseSymbol('GPL-3.0-only', aliases=('GPL-3.0', 'LicenseRef-gpl-3.0'), is_exception=False)
>>> 
>>> from license_expression import Licensing
>>> Licensing().parse('LGPL-2.1+')
LicenseSymbol('LGPL-2.1+', is_exception=False)
>>> Licensing().parse('GPL-3.0')
LicenseSymbol('GPL-3.0', is_exception=False)
>>> 
2024-06-20 11:33:51,315:WARNING:root: Unrecognized license reference: LGPL-2.1+. license_expression must only use IDs from the license list or extracted licensing info, but is: LGPL-2.1+
jspeed-meyers commented 3 weeks ago

I'm wondering if this is expected behavior (and you do not with to allow aliases), or if this is a bug.

I'm curious too. My own naive thought is that it would be nice if the validation processing accepted aliases :) That seems reasonable and consistent with the purpose of using aliases. But perhaps that creates bug or corner cases that I am not aware of?