justinwilaby / sax-wasm

The first streamable, fixed memory XML, HTML, and JSX parser for WebAssembly.
MIT License
168 stars 8 forks source link

Rust crate and changes to `EventListener` signature #70

Open samuelcolvin opened 1 year ago

samuelcolvin commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Hi, I'm using sax-wasm from rust (it's great 🙏), I'm currently using git submodules, which is working fine, but clearly a proper rust crate would be ideal. (To be clear - a rust crate would be good long term, but I don't actually need it right now)

Secondly, I'm having some trouble setting up a "safe" way to collect nodes of an xml document in rust.

The problem is

pub type EventListener = fn(event: Event, data: &dyn Encode<Vec<u8>>);

The only way I can see to mutate an object in a function which matches this signature is to use a static mut and then unsafe, which isn't ideal.

Maybe I'm missing something obvious, but it would be great to be able to process "event/data"s and mutate a struct without resorting to unsafe.

Describe the solution you'd like A clear and concise description of what you want to happen.

I guess either:


Let me know if the request is unclear and I can add specific examples.

samuelcolvin commented 1 year ago

okay, I have this working using lazy_static! and Mutex, not sure on the performance impact of the Mutex, but otherwise that's working for me.

samuelcolvin commented 1 year ago

@justinwilaby I see you added some labels, does that mean you'd be happy to support this?

If so, what would your preferred approach be?

It would also be useful to extend Encode to provide methods to get target and content without encoding and decoding them - would you be okay to accept this change? It shouldn't have any affect in the wasm case since the methods wouldn't be used and should be compiled out.

justinwilaby commented 1 year ago

@samuelcolvin - Yes. I believe these changes would be beneficial for sure.

I like the idea of implementing the IntoIterator trait to SAXParser but this may become complicated when building an aggregator for the various nodes encountered when processing a call to write. If we do implement IntoIterator, what exactly is iterated? Tags, attributes, text nodes or all the above?

I have toyed with the idea of using IntoIterator as super trait of Encode with the XML root treated as an Encode type. From there, it is possible to go further and provide implementors a way to walk the entire XML tree conditionally in a similar way that Acorn Walk does for a JavaScript AST.

The major problem to overcome is avoiding the memory hit of storing all entities in the aggregator which may be a side effect of iterating especially when a very large XML document is fully loaded before iteration begins.

With this said, in the short term, updating the signature of Encode seems to be the most practical. For wasm targets, keep the event handler as-is. For C bindings or a crate, just return a pointer to Encode.

samuelcolvin commented 1 year ago

Thanks for the response.

For me an iterator that yields Events would be fine - that's what I'm working with at the moment.

I'd also be happy with a next() or get_event() method on SAXParser without a formal iterator trait.

What I currently have is

lazy_static! {
    static ref TEMPLATE_BUILDER: Mutex<TemplateBuilder> = Mutex::new(TemplateBuilder::new());
}

fn event_handler(event: Event, data: &dyn Encode<Vec<u8>>) {
    TEMPLATE_BUILDER.lock().unwrap().event_handler(event, data);
}

fn parse_sax(data: &PyBytes) -> PyResult<PyObject> {
    let mut p = SAXParser::new(event_handler);
    p.events =
        Event::Text as u32 |
        Event::Doctype as u32 |
        Event::Comment as u32 |
        Event::OpenTagStart as u32 |
        Event::Attribute as u32 |
        Event::CloseTag as u32 |
        Event::Cdata as u32;
    let b = data.as_bytes();
    p.write(b);
    p.identity();
    let mut t = TEMPLATE_BUILDER.lock().unwrap();
    // get the node
    t.reset();
}

it would be nice to get rid of the lazy_static if possible.