tksk / jing-trang

Automatically exported from code.google.com/p/jing-trang
0 stars 0 forks source link

Optimize multiple externalRefs of same URI #118

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

See http://tech.groups.yahoo.com/group/rng-users/message/1218

Validation against a relatively complicated schema using several files and 
externalRefs and includes is about an order of magnitude slower than validating 
against the same schema, but run through the "simplification" process via jing. 
The schema and an example xml for validation (waveform.xml) is attached.

Here are times for validating against the simplified (sod_simplified, 4.9 sec) 
and multifile (sod.rng, 48 sec) as well as the simplifier step (53 sec):

$ time java -Xmx512m -jar ../jing-20091111/bin/jing.jar sod.rng waveform.xml

real    0m48.760s
user    0m45.783s
sys 0m5.931s

$ time java -Xmx512m -jar ../jing-20091111/bin/jing.jar -s sod.rng  > 
sod_simplified.rng

real    0m53.418s
user    0m50.634s
sys 0m6.148s

$ time java -Xmx512m -jar ../jing-20091111/bin/jing.jar sod_simplified.rng 
waveform.xml

real    0m4.913s
user    0m5.584s
sys 0m0.329s

Also, there are 8803 lines in the original multifile schema. In the 
"simplified" schema there are 865493, an increase of a factor of 100.

Looking in the simplified schema, it appears that every externalRef is being 
expanded in place, even if the same file is used multiple times. For example, 
timeInterval is a small element, and appears in two externalRefs in the 
original schema, but appears 5820 times in the simplified schema.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?
jing 20091111
Mac OSX 10.6
java 1.6.0_20

Please provide any additional information below.

Original issue reported on code.google.com by pcrotw...@gmail.com on 18 Aug 2010 at 11:40

Attachments:

GoogleCodeExporter commented 9 years ago
I have created a simpler test. In the attached tar, there are 2 very similar 
schema consisting of 4 files, A, B, C, D. The only difference is that in 
simpleTest1 file C.rng does not <include> D.rng and in simpleTest2 C.rng does 
<include> D.rng. 

In simpleTest1 the output looks as it should. 

In simpleTest2 the output includes two copies of C with A having a <rev 
name="C"/> and B having a <ref name="C_2"/>. So, something about the <include> 
of D in C causes a duplication of C in the simplified output for B. 

Sorry, the ABCs are a little confusing, but at least the test case is smaller.

Original comment by pcrotw...@gmail.com on 18 Aug 2010 at 8:00

Attachments:

GoogleCodeExporter commented 9 years ago
In general the semantics of externalRef in RELAX NG are XML-level inclusion. 
It's a bit like an entity ref in XML, and unlike a normal ref in RELAX NG.  
Different occurrences of externalRef may result in semantically distinct 
patterns (because (a) referenced schemas may contain "free" refs and 
externalRefs may occur in distinct grammars and (b) the ns attribute).  So at 
the moment each occurrence of an externalRef results in a separate parse of the 
referenced URI. When you have externalRefs to a single URI and that schema in 
turn has multiple externalRefs to a singleURI, this results in large internal 
representation (like entity refs in XML). Most of the time in your example is 
taken up with XML parsing.

It would be possible to optimize this by

- noticing when an externalRef does not make any outside ref/parentRefs
- caching externalRefs to the same URI made within the same grammar and with 
the same ns

This would involve adding suitable methods to 
com.thaiopensource.relaxng.parse.Scope.

In the meantime I would suggest wrapping each externalRef in a define, and then 
ref that define.

Original comment by jjc.jclark.com on 24 Aug 2010 at 4:17