JSP parser should recognize and ignore JSP comments and Java scriplets

ghost commented 21 years ago

Reported by @cristianbogdan on 1 Jun 2003 23:22 UTC Java scriplets are of no significance for the JSP parser, but the trouble is, they may contain stuff like

<%

//

%>

which would certainly fool the parser into believing that there is a real tag there

Migrated-From: http://trac.makumba.org/ticket/465

ghost commented 21 years ago

Comment by @frederikhabils on 2 Jun 2003 08:20 UTC I'm also curious: how does the parser handle jsp comments?

<%--

... ... --%>

ghost commented 21 years ago

Comment by farry@best.eu.org on 2 Jun 2003 08:37 UTC Do you want the parser to completely ignore java scriptlets, or only comments in java scriptlets? For something I'm doing now, I need mak-tags inside a java scriplet to fill an array with some values a got from a mak:list.

So, is the title of this bug correct? or should it be changed to "JSP parser should recognize and ignore comments in Java scriplets"?

ghost commented 21 years ago

Comment by @stefanb on 2 Jun 2003 08:53 UTC Farry, mak tag in a scriptlet is impossible. You are probably talking about <% if(...) {%><%}%> which is actually a mak tag between two scriptlets and thus won't be ignored. Or you have some other example?

as with all tag matching, the algoritm should not be greedy, but just find the nearest end tag

2 more patterns to ignore: <% / / %> <% /* / %>

ghost commented 21 years ago

Comment by farry@best.eu.org on 2 Jun 2003 09:27 UTC Ok, probably I misunderstood due to my limited "jargon" knowledge... what I meant was:

I guess this is called a script and not a scriplet, right? I was afraid that this bug was aiming at ignoring mak:tags in just pieces of code

sorry for misunderstanding!

ghost commented 21 years ago

Comment by @cristianbogdan on 2 Jun 2003 10:41 UTC in response to comment 1 the parser simply replaces comments with whitespace (like any parser does with comments) see org.makumba.util.JspParseData, org.makumba.util.SyntaxPoint

about comment 4, scriplets are <%....%>, they are different from javascript

ghost commented 21 years ago

Comment by @stefanb on 2 Jun 2003 14:18 UTC khm.

Farry, mak tag in a scriptlet is impossible. Realizing what i said, it is probably safe to treat all <% scriptlets %> exaclty as comments, ignoring them in the page analysis.

Scriptlet is anything that matches a pattern "<%((^%)|(%[^>]))*%>" (it matches also all <%-- comments --%>)

ghost commented 21 years ago

Comment by @cristianbogdan on 2 Jun 2003 14:57 UTC comment 6:

Scriptlet is anything that matches a pattern "<%((^%)|(%[(it matches also all <%-- comments --%>)

it also matches system tags (<%@... %>) so it should be %[^@](^]))%>")((^%)|(%[it may not be that simple. if you look at the present comment pattern, it's pretty messy <%--(^-)%>

but)|(-[^-])|(--[^%])|(--%[^>]))*--%>

which probably leads to the huge stack space eaten by long comments (bug 447 comment 6). i'd appreciate some help with the regexps. to work on them,

checkout branch noTrLoops, file org/makumba/util/JspParseData.java
to see what's being wiped out enable logging org.makumba.syntaxpoint.comment.level=FINE
define systemproperty makumba.jspparse.debug to get more parser logging info
if you have a karamba in the same budnle, do ant getLocalMakumba to get it compiled and loaded in karamba, then reload (via parade or standalone)

note that many chars used in regexps need be escaped by \ due to java conventions, and by another \ (that is, \ in java) due to regexp conventions

so a regexp \ is \\, a regexp " is \\" :) when i got bored of this, i made a function that generates regexps for JSP tag attributes...

of course, help for all this is available http://java.sun.com/j2se/1.4.1/docs/api/java/util/regex/Pattern.html

ghost commented 21 years ago

Comment by @stefanb on 2 Jun 2003 15:20 UTC the pattern that will only match <% scriptlets %> and neither of <%-- comments --%> or <%@ stuff %> should be something like: "%(([^-@])|(-[^-]))?((^%)|(%[^]))*(([^-])|([^-]-))?%>" although it is probably quite heavy in terms of resources needed.

but still <%@ ... %> can also be ignored when computing what fields will be needed in the page. no?

ghost commented 21 years ago

Comment by @stefanb on 2 Jun 2003 15:25 UTC

but still <%@ ... %> can also be ignored when computing what fields will be needed in the page. no? Ah, no, there is <%@include ... %> but includes can be matched seperately, using another (light!) pattern.

ghost commented 21 years ago

Comment by @cristianbogdan on 2 Jun 2003 18:34 UTC <%@ ... %> is also needed for @taglib... it is currently in use, please check the source...

ghost commented 21 years ago

Comment by @frederikhabils on 3 Jun 2003 23:03 UTC about the jsp comments, I figured out that tomcat is treating the following as NOT closing a comment: ---%>

example: (https://parade.best.eu.org/fred-k/test/bug465.jsp)

<%-- inside the comment ---%> still inside the comment --%> outside the comment

in other words , the end of a comment must be [^-]--%> (if I understood my first reading of regex)

I tried to read the jsp specification at java.sun.com, but didn't understand their notation... so I don't know what the standard prescribes.

ghost commented 21 years ago

Comment by @frederikhabils on 3 Jun 2003 23:18 UTC i know about as much regex as i know greek by now, so that's not good, but anyway, I'm sorta wondering how the regex in comment 8, for

<% expression; %>

would perform on this extreme case (and variants):

<% String s = "<% smile %>"; %>

ghost commented 21 years ago

Comment by @stefanb on 4 Jun 2003 02:09 UTC updated test page a bit https://parade.best.eu.org/fred-k/test/bug465.jsp and discovered that any of: <%-- comment ---%> <%-- comment ----%> <%-- comment -------%> throws: org.apache.jasper.compiler.ParseException: Unterminated <%-- tag while all of: <%-- comment +--%> <%-- ok comment <%--%> <%--+-- comment --%> <%---- comment --%> are perfectly ok

JSP spec can be downloaded from http://java.sun.com/products/jsp/download.html (unfortunately just in pdf form, so i can't reference it dorectly with a hyper link)

Quoting JavaServer Pages 1.2 Specification Chapter: JSP.2.5.2 JSP Comments

A JSP comment is of the form <%-- anything but a closing --%> ... --%> The body of the content is ignored completely. Comments are useful for documentation but also are used to

ghost commented 21 years ago

Comment by @cristianbogdan on 4 Jun 2003 11:29 UTC personally i'm satisfied with what we have now (except maybe for the fact that scriplets are not ingored), so i'd appreciate help with improving the JspParseData in this and other directions

the stack thing (comment 7) seems to be a bit of a problem (see org.makumba.util.JspParseData)

although i did nothing to the comment pattern (except moving it in the source, see diff between 1.1.2.3 and 1.1.2.4...), event/welcome.jsp now seems to throw StackOverflowError no matter how large -Xss is (i think extremely large values for -Xss are ignored anyway)

removing the long comment from events/welcome.jsp makes it OK again. strangely enough, after adding the comment back, it still works :)... (i think it doesn't work at the moment in mak-test-k, but it works in my workplace, http://tequila.best.eu.org:11666)

so this seems to be hard to control, although the cause is most probably related to the call stack

one obvious solution is to remove the long comments. however, if a common pattern is used for comments and scriplets, it's hard to remove long scriplets... though longish comments are more probable than very long scriplets...

another solution is not to use a pattern (but plain non-recursive java and a little java.util.Stack if needed) and remove them "by hand"...

if you want to stick to regex, using reluctant/greedy/possessive quantifiers might bring a solution (e.g. a simpler pattern for the comment/scriplet, which would make it result in fewer recursive calls, => eating less stack)

about comment 11 and comment 13: i think that whatever tomcat does not accept should not be considered. tomcat built-in JSP analysis will take place before the makumba JSP analysis anyway

about comment 12: you seem to be learning regex faster than Greek, which contributes to explain why there are so many Perl fanatics and so few non-Greek- speakers with Greek girflriends :))

also about comment 12: indeed, Java string constants in scriplets need be ignored before starting to look for the end of the scriplet. luckily Java accepts only " string " and not ' string ' so parsing is easier than in the case of JSP attributes (see bug 458)

about comment 13: Char - ( Char

ghost commented 21 years ago

Comment by @stefanb on 4 Jun 2003 12:19 UTC classes/org/makumba/util/JspParseData.java defines JspCommentPattern as "<%--([since there must be no "-" immediately before "--%>" to be a valid comment closer, it should IMO be like: "<%--(^---%>" but)|(-[^-])|(--[^%])|(--%[^>]))[^-]--%>" ^^^^-new as Fred also suggested in comment 11

If performance is an issue, we might want to try some ANTLR based JSP parser, but i haven't found any yet. http://www.antlr.org/showcase.html claims that BEA web logic is using antlr for jsp parsing.

We could write one, starting eg from html or some other jsp-like language: http://www.antlr.org/resources.html

ghost commented 21 years ago

Comment by @frederikhabils on 13 Jun 2003 14:51 UTC [

I found that comment 12 can probably be ignored

<%^!@- will start a scriptlet (or is it [?) %> will end a scriptlet you need to escape <% as <\% in order to print <% in html you need to escape %> in order to use %> within the scriptlet

I've concluded that from http://javaboutique.internet.com/tutorials/JSP/part11/page03.html http://javaboutique.internet.com/tutorials/JSP/part11/page04.html

-- B

About jsp parsers... I found sth called GNUJSP mentioned here and there http://www.google.be/search?q=GNUJSP

They have open source cvs at http://www.gjt.org/servlets/JCVSlet/list/gjt/org/gjt/jsp

dunno if that will help us, it doesn't seem to be following recent updates to jsp standard.

-- [C]

finally... should we also care about parsing the XML style of the JSP syntax ?

ghost commented 21 years ago

Comment by @cristianbogdan on 13 Jun 2003 16:12 UTC comment 16:

you need to escape <% as <\% in order to print <% in html you need to escape %> in order to use %> within the scriptlet

that's good. that means that the whole thing is designed for easy identification of comments and scriplets, no need for regexps. see http://tequila.best.eu.org:11666/test/comment.jsp

having learned this, i think that uncommenting (and similarly, unscripletting) in Java is pretty simple (but needs testing :). the method below can replace the SourceSyntaxPoints.unComment method

void takeOut(StringBuffer sourceCode, String start, String end, String syntaxPointName) { int n, m=-1, i; while(true){ n= sourceCode.indexOf(start, m+1); if(n==-1) break; m= sourceCode.indexOf(end, n+start.length()); if(m==-1) throw new RuntimeException("closure missing, how did this pass tomcat's parser???"); m+=end.length(); for(i=n; i<m; i++)sourceCode.setCharAt(i, ' '); // add a syntax point pair for the comment or the scriplet, with the type indicated by syntaxPointType... etc } }

// of course first you have to take out comments takeOut(jsp, "<%--", "--%>", "jspComment") // then you take out scriplets takeOut(jsp, "<%", "%>", "jspScriplet");

About jsp parsers... I found sth called GNUJSP mentioned here and there http://www.google.be/search?q=GNUJSP

yes, can be of use, the syntax of scriplets and comments didn't change much i think

finally... should we also care about parsing the XML style of the JSP syntax ?

not for now... i mean, i've never seen it used in pages, not sure even if the tomcat parser recognizes it.

once the page is in well-formed XML, an XML parser is much easier to write (especially SAX).

usually the XML form is only available to the page validator, which only runs if the page needs compilation => useless for makumba (because makumba needs to run page analysis after every webapp restart).

see https://parade.best.eu.org/tomcat-docs/servletapi/javax/servlet/jsp/tagext/TagLibraryValidator. especially the validate() method

also, the page validator does not offer a way to find out which is the source file of the page being validated, so there's no way for makumba to associate the analysis done at validation time to the page that runs at runtime... so they support validation, not analysis, caching of analysis, etc.

ghost commented 21 years ago

Comment by @frederikhabils on 15 Jun 2003 22:42 UTC haven't had time yet to dig in the source code, but...

comment 17:

// of course first you have to take out comments takeOut(jsp, "<%--", "--%>", "jspComment") // then you take out scriplets takeOut(jsp, "<%", "%>", "jspScriplet");

while comment 9 and comment 10 seem to suggest that system tags like <%@ ... %> are needed for some reason.

These would be taken out by a call

// then you take out scriplets takeOut(jsp, "<%", "%>", "jspScriplet");

is that acceptable?

OTOH, <%! declaration %> and <%= expression %> are safe to erase, just as true <% scriptlet %>

ghost commented 21 years ago

Comment by @cristianbogdan on 16 Jun 2003 08:21 UTC comment 18: correct, anything starting with <%@ should not be ignored you can do that e.g. with a special parameter to takeOut...

ghost commented 21 years ago

Comment by @frederikhabils on 17 Jun 2003 22:46 UTC based on comment 17 & 19, I created a function

void takeOut(StringBuffer sourceCode, String start, String[ startNotFollowedBy, String end, String endNotPrecededBy, String syntaxPointType)

in /util/SourceSyntaxPoints.java version 1.1.2.2

which can be called as (util.JspParseData.java version 1.1.2.11)

StringBuffer uncommentedContent= new StringBuffer(content); String[commentEndNotPrecededBy = { "-" }; syntaxPoints.takeOut(uncommentedContent, "<%--", null, "--%>", commentEndNotPrecededBy , "jspComment"); String scriptletStartNotFollowedBy = { "@" }; syntaxPoints.takeOut(uncommentedContent, "<%", scriptletStartNotFollowedBy, "%>", null, "jspScriplet"); content = uncommentedContent.toString();

I tested it, and it works.

However, then I had a revelation about reluctant regex, thinking that these 2 simple (?) reluctant patterns should work:

(util.JspParseData.java version 1.1.2.12) JspCommentPattern = Pattern.compile("<%--.?[Pattern.DOTALL); JspScriptletPattern= Pattern.compile("%[^@](^-]--%",).?%>", Pattern.DOTALL); ... content= syntaxPoints.unComment(content, JspCommentPattern, "JspComment"); content= syntaxPoints.unComment(content,JspScriptletPattern, "JspScriptlet");

I tested it too, and it works also...

I tested in https://parade.best.eu.org/fred-k/test/bug465a.jsp with a comment of about 35KB long, and there are no problems

-- So, although it took 3 hours to program for solution 1, I guess solution 2 is much cleaner code (15 minutes work) :-)

-- All changes (SourceSyntaxPoints + JspParseData) are committed along the noTrLoops branch. makumba 'development' tested in fred-k.

We could consider replacing the (tomcat-specific?) jsp-comment pattern <%--.? [^-]--%> by the standard <%--.?--%> and throw a warning/error when finding -- -%> ?

Shall we close this bug?

ghost commented 21 years ago

Comment by @frederikhabils on 17 Jun 2003 22:50 UTC I could add that I got my info about reluctant regex on this page http://java.sun.com/docs/books/tutorial/extra/regex/quant.html section: Differences Among Greedy, Reluctant, and Possessive Quantifiers

... part of the ever-useful Java Tutorial http://java.sun.com/docs/books/tutorial/

ghost commented 21 years ago

Comment by @cristianbogdan on 18 Jun 2003 09:28 UTC excellent work Fred!

i think we should close this.

could you also copy it in the refactorTaglib branch? just checkout, copy there and commit, so we don't forget about this later on. in general, i think changes to 0.5.8. should be copied to 0.5.9 as well. then we will decide which of them to join in the main trunk. i know this may look awkward, but i suggest to do it for now, supposing there will be only a few changes to 0.5.8.

also, feel free to tag a new version

ant -Dtag=makumba-0_5_8_7 tagJar then "copyLocalMakumba" in fred-k will show that version

or do the tagging in parade/cristi/makumba (after cvs update), then use install-0.5.8

ghost commented 21 years ago

Comment by @frederikhabils on 18 Jun 2003 20:22 UTC

I've totally rearranged (without any changes to actual code) JspParseData, into what I believe is a more logic ordering of class's fields, methods, inner classes. Committed 1.1.2.13. http://cvs.makumba.org/cgi- bin/cvsweb.cgi/makumba/classes/org/makumba/util/Attic/JspParseData.java
I've done the tag command to create 0.5.8.7 in fred-m / fred-k.
copied changes (SourceSyntaxPoints + JspParseData) into the refactorTaglib (0.5.9) branch. CHANGELOG changed (mmm, only in 0.5.8 branch for now, forgot to copy)

Marking bug as Resolved.

makumba / makumba

JSP parser should recognize and ignore JSP comments and Java scriplets #221