nikku / node-xsd-schema-validator

A schema (XSD) validator for NodeJS
https://www.npmjs.com/package/xsd-schema-validator
MIT License
51 stars 24 forks source link

Big xml processing #7

Closed kovalav closed 7 years ago

kovalav commented 8 years ago

First - thank you for good job.

I have XSD file for certain XML files and want to make validation. I spent some time for testing the module using big XML files. I found several problems with the process of validation 1) For example, for the XML for about 160MB size java crashed with the "Out of memory (Heap size)" error. To prevent this crash it necessary to add option in the call validator spawn function for increasing heap size for the JAVA (-Xmx1280m, for example, for the XML size 160MB). 2) under windows in the Java the stdin or xsd file encoding is not correct (in my case I have the PC with the greek locale) and parser throws error, so I added -Dfile.encoding=UTF-8 parameter for Java start

var validator = spawn(JAVA, [ '-Dfile.encoding=UTF-8', '-Xmx1280m', '-classpath', . . . .

After this, I looked to the XMLValidator.java class, because it should be not used so much memory with the SAX parser.

After analyzing, I made some changes in the validator.js/XMLValidator files. 1) I used pipe for transfer source file to java stdin (validator.js), lines 126-127 changed to:

xml.pipe(stdin);
xml.on('end', function(){
  xml.unpipe(stdin)
});

2) I update XMLValidator.java class for using stdin stream direct to SAX parser:

validator.validate(new StreamSource(System.in));

And changed call of validateXML function:

var readstream = fs.createReadStream("test.xml"); validator.validateXML(readstream, 'feed-structure.xsd', function(err, result) { . . . . }

Yes, the using of string for file name of XML file now is impossible, but as result of changes the used memory was decreased dramatically. For example, the memory usage before changes (160MB XML file, -Xmx1280m value was used for JAVA call, 1024m value was small for processing):

\ Memory before start *** free memory: 45.628 allocated memory: 47.104 max memory: 1.165.312 total free memory: 1.163.836 Start validation ** Memory after end *** free memory: 233.916 allocated memory: 1.309.184 max memory: 1.309.184 total free memory: 233.916

and after changes (for the same XML file), without using of -Xmx...m value:

\ Memory before start *** free memory: 45.628 allocated memory: 47.104 max memory: 699.392 total free memory: 697.916 Start validation ** Memory after end *** free memory: 35.381 allocated memory: 46.592 max memory: 699.392 total free memory: 688.181

The function from http://stackoverflow.com/questions/74674/how-to-do-i-check-cpu-and-memory-usage-in-java was used for printing memory usage

nikku commented 8 years ago

Thanks for your feedback.

I am happy to make improvements to the library, if you provide them as part of a pull request.

To be honest, I did not validate a 160 MB XML file with this library, yet :wink:.

kovalav commented 8 years ago

Sorry, did you mean pull, or commit?

If I'm not wrong, with the pull request I just can download updates from the github.

At list in the Eclipse (EGit plugin) for github I'm using Pull command for downloading changes to my local projects

deepaknverma commented 7 years ago

@kovalav I can understand what you are trying to do in point 1 and then in point 2 where you are passing the stdin to sax parser. what i cant understand is where you want to pipe through xml in validator.js line 126-127.

Happy to make PR with changes and test cases

kovalav commented 7 years ago

Because in my case xml parameter is the stream, not string

In the last version of my updates I make the following changes for lines 126-127

`if(typeof(xml) === 'string') { // string

  stdin.write(xml);

  stdin.write('\n---end---\n');

} else { // stream

  xml.pipe(stdin);

  xml.on('end', function(){

    xml.unpipe(stdin)

    xml.on('unpipe', function(){

      stdin.write('\n---end---\n');

    })

});`

My goal is not PR, just I hope, above changes will help to somebody in the case of very big XML data

deepaknverma commented 7 years ago

Make sense. Found some issues with FileInputStream in line 122. The program can potentially fail to release a system resource. Resource leaks have at least two common causes:

nikku commented 7 years ago

Sorry for the delay. The parser now accepts readable streams, too and has been optimized to work directly on these (Java side). This should fix this issue.