spdx / Spdx-Java-Library

Java library which implements the Java object model for SPDX and provides useful helper functions
Apache License 2.0
32 stars 33 forks source link

TemplateRegexMatcher.getStartRegex sometimes returns a regex that matches with an index before the license start #244

Open sdheh opened 2 weeks ago

sdheh commented 2 weeks ago

Version 1.1.11 Example 1 greedy regex after optional:

String licenseText = "ab cd text";
String licenseTemplate = "<<beginOptional>>cd<<endOptional>> <<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(\Qcd\E\s*)?(.{0,5000})\Qtext\E\s*
start index found: 0

but the start index should be 3.

Example 2 greedy regex at start:

String licenseText = "abtext";
String licenseTemplate = "<<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(.?{0,5000})\Qtext\E\s*
start index found: 1

but the start index should be 2. .?{0,5000} doesn't seem to work as expected. It is an unusual regex that some online regex websites say is invalid: https://regex101.com/r/l3810b/1, regexr.com/81kfo. https://www.freeformatter.com/java-regex-tester.html says the regular expression is valid.

I think maybe to fix this you could just offer a method for a regex to find the beginning of the non-optional part. Otherwise a changing the regular expressions in these two cases to something like the following could work

(?im)((\Qcd\E\s*)(.{0,5000})\Qtext\E\s*)|(\Qtext\E\s*)
(?im)\Qtext\E\s*

In the first case if there were multiple optional parts it would get even more complicated to do it correctly.

goneall commented 1 week ago

Now that I review this issue, I do tend to agree that this is an issue if the method is being used without following on with the template matcher.

@sdheh - It looks like you have a pretty good handle on approaches to fix this. Can you create a pull request?