spdx / LicenseListPublisher

Tool that generates license data found in the license-list-data repository from the license-list-XML source
Apache License 2.0
11 stars 18 forks source link

Alt match cannot be used inside copyrightText #162

Open ppisar opened 1 year ago

ppisar commented 1 year ago

When adding Latex2e-translated-notice in https://github.com/spdx/license-list-XML/pull/1932, I wanted a code>@copyright{}</code string in a copyrightText block to be an alternation to a Unicode © string. My motivation was to match both source in texinfo language and a rendered text.

It turned out that licenseListPublisher-2.2.8.jar was unable to handle it. It seems that the curly brackets are separated from the adjacent "copyright" word before the alternation match is performed.

Here is a minimal reproducer:

$ cat src/test.xml
<?xml version="1.0" encoding="UTF-8"?>
<SPDXLicenseCollection xmlns="http://www.spdx.org/license">
   <license isOsiApproved="false" licenseId="test"
   name="test alt match with curly brackets" listVersionAdded="0">
     <text>
       <copyrightText>
         <p>
           before <alt name="symbol" match="@copyright\{\}">@copyright{}</alt> after
           <!--before @copyright{} after-->
         </p>
       </copyrightText>
     </text>
   </license>
</SPDXLicenseCollection>

$ cat test/simpleTestForGenerator/test.txt 
before @copyright{} after

$ ./test-one-license test
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
Difference found comparing to test file: Unable to find the text ' after  ";match=".{0,5000}">>
' following a variable rule 'symbol' starting at line #1 column #20 ""

If the alternation is out of copyrightText block, it works.

Maybe related issues: #87, #100.

goneall commented 1 year ago

Thanks @ppisar for reporting this along with the details on the publisher behaviour.

I'll take a look at a solution after the OSSNA Summit (quite busy with SPDX 3.0 stuff).

goneall commented 1 year ago

It looks like this is an issue with the <copyrightText> tag generating an alt/var tag around the alt/var tag in the XML.

Here's the template file generated:

<<var;name="copyright";original="before <<var;name="symbol";original="@copyright{}";match="@copyright\{\}">> after  ";match=".{0,5000}">>

@ppisar - can you try omitting the <copyrightText> tag and see if it works?

goneall commented 1 year ago

Note: We should check for this situation in the publisher and report an appropriate error rather than generating a template which won't work.

ppisar commented 1 year ago

I've already written in my original report that the alternation works outside copyrightText tree.

goneall commented 1 year ago

I've already written in my original report that the alternation works outside copyrightText tree.

Thanks @ppisar - I missed that in the original report.

The issue is a bit more general - Alt matches just won't work inside copyrightText. I'm going to update the title to reflect this.

This would be a very difficult issue to fix with the current design, so it may be a while before this is actually fixed. As a work around we can avoid alt tags inside copyrightText. Turns out there is already a general matching pattern generated for the copyrightText, so matches should still work.

ppisar commented 1 year ago

That's understandable. Then please update the documentation. If possible, also the XML schema not to accept alt inside copyrightText.

goneall commented 1 year ago

@ppisar I created https://github.com/spdx/license-list-XML/pull/1964 to update the documentation and https://github.com/spdx/license-list-XML/issues/1965 to track the request to update the schema