shepmaster / sxd-document

An XML library in Rust
MIT License
154 stars 36 forks source link

Are raw pointers and unsafe actually needed? #68

Open AxFaure opened 5 years ago

AxFaure commented 5 years ago

Hello,

This is probably more of an open question than an actual issue.

There seems to be quite a lot of unsafe code in this crate (mostly pointer handling) and there are some issues that mention memory unsafety. So I was wondering: Is this unsafe code actually needed?

Let me know if I am wrong but I guess the use of pointers was preferred over Rc/Weak references to get better performances. But, on the other hand, this crate offers a DOM interface which is expected to be a bit slower and heavier than SAX, and we usually expect memory safety from rust crates. So, do you think replacing the unsafe code would have so much of an impact on performance that it is worth the potential safety/security issues of dealing with complex pointers relationships?

I am in no way an expert in Rust (or XML parsing) by the way so do not hesitate to let me know if I am totally wrong.

As a completely unrelated side note, you seem to have worked quite a lot on this crate (thank you for sharing it with us!), so I was wondering if you consider the DOM part of the project to be production ready or if you think it still needs more work?

Thanks

shepmaster commented 5 years ago

Let me know if I am wrong but I guess the use of pointers was preferred over Rc/Weak references to get better performances.

That is correct. One of the stated goals is to beat out libxml2. Everything I know about Rust says that this should be possible, or at least get within a very close margin.

To that end, I periodically compare the performance of sxd-document / sxd-xpath against similar libxml2 implementations. I have yet to get as close as I'd like.

slower and heavier than SAX

It's true (and there's a SAX implementation inside of the code, just not exposed as a public API) for this reason. Theoretically, parsing the XML itself should require a constant amount of heap allocation and hopefully be speedy.

replacing the unsafe code would have so much of an impact on performance that it is worth the potential safety/security issues

I think if you look back far enough in the history, the DOM did originally use Rc and IIRC there was a benefit to changing it.

That being said, I'm not against experimenting with alternate ways of implementing the "DOM". There's a secret API I've named the "thin" API, for example. I'm almost certain that the thindom is hopelessly broken, but it's an attempt at a Rust-first thinking about the problem of modeling XML documents. It wouldn't surprise me if there were other, better ones out there.

If someone were motivated to give it a shot, I'd love to help guide them in experimenting with some different implementations.

shepmaster commented 5 years ago

consider the DOM part of the project to be production ready

yes

or if you think it still needs more work?

yes 😄

I don't have much of a reason to use SXD in my own day-to-day work, which means that certain pieces are likely lacking. You can see that a recent release finally added the ability to remove children, mostly because I was generally building up documents, not modifying existing ones. It never even occurred to me that these were missing! (Rather embarrassing). Of course, we are happy to get issue reports and PRs improving things!

AxFaure commented 5 years ago

Thank you for the answers. That's very clear.

I will try to look for an alternative solution for thindom and/or potential performance improvement when I have some time.

By the way, I assume you have defined some kind of benchmark you use to compare this crate and libxml2. Maybe you could share it and put something about it in the Readme. It may help bring some attention and help on this issue (for a newcomer, it is not obvious that there is a specific goal which is not reached yet in this area)

shepmaster commented 5 years ago

Maybe you could share it and put something about it in the Readme

Not a bad idea, but it involves a 100MB XML file, which isn't really the kind of thing you should put into git. I'll have to figure something out for that.