soerenmeier / parse-wiki-text-2

MIT No Attribution
5 stars 5 forks source link

Add `max_ms` parameter to parse function to limit execution time #5

Closed emmalexandria closed 8 months ago

emmalexandria commented 8 months ago

Solves #4.

I was thinking it's also possible to change the code of configuration.rs to the following:

#[must_use]
    pub fn parse<'a>(&self, wiki_text: &'a str) -> crate::Output<'a> {
        crate::parse::parse(self, wiki_text, 0).unwrap()
    }

    #[must_use]
    pub fn parse_with_timeout<'a>(&self, wiki_text: &'a str, max_ms: u128) -> Result<crate::Output<'a>, &'static str> {
        crate::parse::parse(self, wiki_text, max_ms)
    }

and then modify the code I put in parse.rs so that if the max_ms parameter is 0, no limit is applied. That would mean that people who don't need to use the timeout code won't be affected if it's removed in future. I'm happy to implement that in this fork if it seems like a good idea.

soerenmeier commented 8 months ago

Thanks for the PR.

I think in almost all cases the execution time should be limited. Maybe there is a good default that could be used in the parse function. Since you parsed a lot of wikimedia text, do you know the average execution time? Then we could have two functions customizing this behaviour: parse_with_timeoutand parse_without_timeout.

I think it might be useful to have the parsed nodes and warnings still available even if the timeout occurred. Either there could be a property in crate::Output something like errors or the return type is like Result<Output, Error> with the error being something like enum Error { TimedOut { execution_time: Duration, output: Output } }.

Instead of u128 use std::duration::Duration that seems more appropriate.

What do you think?

emmalexandria commented 8 months ago

That all sounds good to me. I'll add some quick and dirty code to my Wikipedia dump parser to output an average parsing time to a CSV file.

emmalexandria commented 8 months ago

I added those suggestions minus a few things. I'm not good enough with Rust to understand the lifetimes with Output, so I didn't include it in the return value. If anyone knows the correct way to do so, then I'll definitely add it.

In addition, I'm yet to have a figure for the average time to parse an article, but I stuck with 5 seconds as a solid default. Once I have a good figure for it, I'll update the default.

soerenmeier commented 8 months ago

Thanks