rjatkins / owaspantisamy

Automatically exported from code.google.com/p/owaspantisamy
0 stars 0 forks source link

SAX-based scanner implementation for reduced memory usage & increased performance #16

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
As mentioned here: 
https://lists.owasp.org/pipermail/owasp-antisamy/2008-June/000048.html a 
SAX-based scanner implementation can lead to performance advantages, because 
the document is 
not held in memory (only a stack holding the operations)

Usage is illustrated in the test case, which is also used for performance 
tests. Running SAX and 
DOM against the AntiSamy Google Code Homepage with the default policy reports 
that SAX is 50% 
faster than DOM.

Original issue reported on code.google.com by larstemp...@gmail.com on 7 Jul 2008 at 11:25

Attachments:

GoogleCodeExporter commented 9 years ago
I will investigate your work and let you know what I come up with - we are 
happy to 
increase speed and memory efficiency. But, before we would integrate such a 
patch we 
have to threat model this approach and review the code to see if we're 
confident it 
provides the same level of security.

Original comment by arshan.d...@gmail.com on 7 Jul 2008 at 2:10

GoogleCodeExporter commented 9 years ago
Sounds good. I have basically taken the existing code and translated it from 
DOM to SAX, so instead of 
removing nodes from the tree, I am simply not copying events from the input 
stream to the output stream.

If you have any questions, feel free to ask, I am also on the mailing list, 
this might provide a better forum for 
discussion coding questions.

I am attaching a complete integration of my code into AntiSamy: the DOMScanner 
will use the SAX filter as 
well and there is now SAXScanner that can be used directly by AntiSamy (but you 
will not be able to access the 
DOM result any longer(

Original comment by larstemp...@gmail.com on 7 Jul 2008 at 2:21

Attachments:

GoogleCodeExporter commented 9 years ago
If there is anything I can do to help you with the review, please let me know. 
I would like to use AntiSamy for the 
Apache Sling project, but depend on the SAX-implementation, because I want to 
add additional SAX filters, for 
microformat extraction, link count, spam protection, etc.

Original comment by larstemp...@gmail.com on 22 Jul 2008 at 10:10

GoogleCodeExporter commented 9 years ago

Original comment by arshan.d...@gmail.com on 19 Nov 2008 at 7:15

GoogleCodeExporter commented 9 years ago

Original comment by arshan.d...@gmail.com on 3 Dec 2008 at 3:33

GoogleCodeExporter commented 9 years ago
Has this been added to the trunk and/or a release yet?

Original comment by manosbat...@gmail.com on 12 Feb 2009 at 4:26

GoogleCodeExporter commented 9 years ago
lars, can you email me so we can work on this together? I would like to add 
this by
version 1.4 (1.3 will be out this week hopefully). The patch needs some updates 
and I
am in a good position to make this switch soon.

Original comment by arshan.d...@gmail.com on 17 Mar 2009 at 2:34

GoogleCodeExporter commented 9 years ago
Changing label to enhancement instead of defect as that more accurately 
characterizes
this change.

Original comment by li.jaso...@gmail.com on 17 Mar 2009 at 4:00

GoogleCodeExporter commented 9 years ago
I am very interested in seeing this integrated.  Performance and memory 
efficiency is 
extremely important.  Is there any word on this / 1.4?

Original comment by jason.cl...@gmail.com on 1 Feb 2010 at 7:30

GoogleCodeExporter commented 9 years ago
I've analyzed this patch, and I am all for making the change to SAX, but I 
would like
it to be slightly more organized. I'm not suggesting the current patch isn't 
good, I
just don't know what an organized SAX implementation would look like.

I wouldn't expect this in 1.4, sadly.

Original comment by arshan.d...@gmail.com on 8 Mar 2010 at 6:13

GoogleCodeExporter commented 9 years ago
Lars, you're missing some code from this patch, most notably changes to 
AntiSamy.java 
(and possibly AntiSamyDOMScanner, I can't tell). I'm integrating this code next 
week, 
hook me up!

Original comment by arshan.d...@gmail.com on 21 May 2010 at 7:54

GoogleCodeExporter commented 9 years ago
After some work this was added to the baseline. It will be in the next version
(1.4.1) as an optional scanning method in opposition to DOM-based. Scans will 
still
default to DOM, but an extra method signature will allow users to choose 
SAX-based
instead.

Original comment by arshan.d...@gmail.com on 1 Jun 2010 at 4:55