jeertmans / languagetool-rust

LanguageTool API in Rust
https://docs.rs/languagetool-rust
MIT License
157 stars 6 forks source link

refactor: further separate CLI logic from the API related functionality (see #117) #124

Open Rolv-Apneseth opened 1 month ago

Rolv-Apneseth commented 1 month ago

Continuation of #117.

As discussed, some refactors. You will probably want some more done here and that's OK, just let me know what to do.

One thing for sure is the String values in api/check.rs - which ones exactly would you like me to make &str or Cow<'_, str>?

Also, for cli/mod.rs, I could potentially remove the need for the manual call of cmd.execute for each subcommand using the enum_dispatch crate if you're OK with adding a dependency? Fine as it is anyway though in my opinion, as there aren't many subcommands.

codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Merging #124 will not alter performance

Comparing Rolv-Apneseth:refactor-v3 (c7d342c) with v3 (a7247e4)

Summary

✅ 6 untouched benchmarks

jeertmans commented 1 month ago

Hi @Rolv-Apneseth! I have quickly looked at your PR, it looks very good! I will give it a better look later, but let me answer your first questions.

One thing for sure is the String values in api/check.rs - which ones exactly would you like me to make &str or Cow<'_, str>?

Well, I think &str should be enough, because we will never need to mutate (i.e., overwrite) the text strings, so keeping a reference should be fine. However, If I remember well, Clap makes this hard as you don't get reference to string read from the terminal, but owned string. So parsing text (either pure text or data annotation) from the terminal will required to be able to pass owned data, I fear. Using borrowed data is only useful for when reading files, as we may want to avoid allocating the full content of the files, and read by chunks instead.

So in conclusion, Cow<'_, str> may be the only solution (apart from String).

Also, for cli/mod.rs, I could potentially remove the need for the manual call of cmd.execute for each subcommand using the enum_dispatch crate if you're OK with adding a dependency? Fine as it is anyway though in my opinion, as there aren't many subcommands.

We can give it a try :-) Dependencies should be fine, especially as we can keep it under the cli feature.

Rolv-Apneseth commented 1 month ago

I have quickly looked at your PR, it looks very good! I

Great!

Well, I think &str should be enough, because we will never need to mutate (i.e., overwrite) the text strings, so keeping a reference should be fine. However, If I remember well, Clap makes this hard as you don't get reference to string read from the terminal, but owned string. So parsing text (either pure text or data annotation) from the terminal will required to be able to pass owned data, I fear. Using borrowed data is only useful for when reading files, as we may want to avoid allocating the full content of the files, and read by chunks instead.

Sorry, I meant which things in check.rs you actually wanted converted, but from this I'm assuming it's the Request.text? But I am confused how this would be achieved. Do we not need to allocate a String no matter what the source is? As for reading the file by chunks, do you want me to just make this use a BufReader and create and await requests one at a time instead of the way it is currently?

            let text = std::fs::read_to_string(filename)?;
            let requests = request
                .clone()
                .with_text(text.clone())
                .split(self.max_length, self.split_pattern.as_str());
            let response = server_client.check_multiple_and_join(requests).await?;

And for the text.clone() caused by server_client.check_multiple_and_join, I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

We can give it a try :-)

I've added it in there - I think it fits nicely.

Rolv-Apneseth commented 1 month ago

I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

I've pushed some changes there for this, avoiding the need to clone the input text.

jeertmans commented 1 month ago

Sorry, I meant which things in check.rs you actually wanted converted, but from this I'm assuming it's the Request.text?

Yes, Request.text and Request.data's inner fields. Mainly because one could create a vector of data annotations from a stream of Markdown tokens, which are usually &str of the initial source.

But I am confused how this would be achieved.

I think using Cow<'_, str> should make the trick.

Do we not need to allocate a String no matter what the source is?

Yes and no. Yes because reqwest will have to allocate a string for the HTML request, but no because we don't actually need the user to provide us an owned String. We don't care because reqwest will allocate anyway. So it's good to provide flexibility.

As for reading the file by chunks, do you want me to just make this use a BufReader and create and await requests one at a time instead of the way it is currently?

            let text = std::fs::read_to_string(filename)?;
            let requests = request
                .clone()
                .with_text(text.clone())
                .split(self.max_length, self.split_pattern.as_str());
            let response = server_client.check_multiple_and_join(requests).await?;

Yes, this would be great. The split-text logic isn't really trivial to solve, because we want to avoid splitting text where it would break the meaning of a sentence, but also we cannot have long-running text either. I think working on a better split-logic can be a next step. For the moment, we can just use BufReader and read the whole file, and feed the server with multiple requests (if text is too large). We could also read the file by chunks, but I think this is a fair assumption to assume that any file we want to check should fit completely in the memory.

And for the text.clone() caused by server_client.check_multiple_and_join, I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

Yes, the end-goal would be to remove the complex logic from Server and perform multiple requests merging outside, so the server only has one method per HTTP request, nothing more.

Rolv-Apneseth commented 1 month ago

Hey again. Had a lot of trouble with lifetimes and I could only get 'static to work - not sure I love the whole solution, but if you havea look at the latest commit, is that what you had in mind?

jeertmans commented 1 month ago

Hello @Rolv-Apneseth, rapidly looking at your commit, I think you just forgot to add the lifetime parameter in Request -> Request<'source> (I would prefer using 'source over anonymous lifetime names like 'a). This should explain why you could only use 'static.

However, it makes little sense to have reference in responses, because reqwest::RequestBuilder::send returns an owned Response. We can keep owned String in that case :-)

This will replicate a similar behavior to that of reqwest. E.g., the builder takes a reference to be constructed pub fn json<T: Serialize + ?Sized>(self, json: &T) -> RequestBuilder, but the response returns an owned struct: pub async fn json<T: DeserializeOwned>(self) -> Result<T>.

Rolv-Apneseth commented 1 month ago

I can have a go again but something was complaining about needing to live longer than 'static. As for Response, that makes sense.

jeertmans commented 1 month ago

I can have a go again but something was complaining about needing to live longer than 'static. As for Response, that makes sense.

No issue. If you can't make it compile, just push with 'source, and I will try to fix it myself :-)

Rolv-Apneseth commented 1 month ago

So this code:

#[cfg_attr(feature = "cli", derive(Args))]
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Hash)]
#[serde(rename_all = "camelCase")]
#[non_exhaustive]
pub struct Request<'source> {
    /// The text to be checked. This or 'data' is required.
    #[cfg_attr(
        feature = "cli",
        clap(short = 't', long, conflicts_with = "data", allow_hyphen_values(true))
    )]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'source, str>>,

Leads to this error:

error: lifetime may not live long enough
   --> src/api/check.rs:398:15
    |
391 | pub struct Request<'source> {
    |                    ------- lifetime `'source` defined here
...
398 |     pub text: Option<Cow<'source, str>>,
    |               ^^^^^^ requires that `'source` must outlive `'static`

error: lifetime may not live long enough
   --> src/api/check.rs:392:5
    |
391 | pub struct Request<'source> {
    |                    ------- lifetime `'source` defined here
392 |     /// The text to be checked. This or 'data' is required.
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ argument requires that `'source` must outlive `'static`
    |
    = note: this error originates in the macro `clap::value_parser` (in Nightly builds, run with -Z macro-backtrace for more info)

So I don't think it will work as it is. Should I take out clap from here and have a separate struct for the request arguments?

jeertmans commented 1 month ago

^ requires that 'source must outlive 'static

Oh yeah, I remember why I struggled with &str back then ^^'

Creating a separate struct would work, but that means duplicating a lot of the code. I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general... This isn't as easy as I imagined

Rolv-Apneseth commented 1 month ago

I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general

Not that I can think of anyway. I'd say the options are to either stick with String, use 'static, or a different struct. For the different struct, could an impl From keep fields in sync at least?

jeertmans commented 1 month ago

I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general

Not that I can think of anyway. I'd say the options are to either stick with String, use 'static, or a different struct. For the different struct, could an impl From keep fields in sync at least?

Ok so some news: at the moment, it seems impossible to do that in one struct, see https://github.com/clap-rs/clap/issues/5773 and the related discussion, so I suggest we duplicated the structure: to original Request, with Cow<'source, str>, and the clap-compatible RequestCli, with String. Maybe we can simply avoid duplicating code with a macro rule. If you don't see how to do that, I think I can put have some time during the weekend for this.

Rolv-Apneseth commented 1 month ago

I think it would have to be a procedural macro right? Can't see how to do that with a declarative one.

I have a bit of time today so I'll have a look but feel free to do it if you have a good idea of how it should be done

jeertmans commented 1 month ago

I actually started some work on your branch today, I will hopefully have something working this afternoon :)

Rolv-Apneseth commented 1 month ago

Oh nice, good luck

jeertmans commented 1 month ago

Ok so it compiles with --no-default-features (without CLI), but I fear it isn't possible to conditionally compile the #[deriver(Parser)] based on lifetime $lt == 'static.

I am looking a possible crates that might provide useful tools to avoid rewriting most of the code.

jeertmans commented 1 month ago

Unfortunately, I haven't been able to put more work on this PR this weekend, sorry for that.

I fear more and more that the copy/paste option would be the only way of solving this.

Rolv-Apneseth commented 1 month ago

Hey - been busy moving this weekend. If I have any time in the coming days I can have a look for any crates that could solve this problem - if not, yes, copy paste might have to do for now

jeertmans commented 4 weeks ago

Hey - been busy moving this weekend. If I have any time in the coming days I can have a look for any crates that could solve this problem - if not, yes, copy paste might have to do for now

No issue! Yeah, maybe let's copy and paste (keeping the 'source lifetime for the api/click.rs and use String in cli/checks.rs). The "easiest" solution would be to fix the upstream issue, i.e., do a PR to Clap's repository, but that I will only look at after this month,

Rolv-Apneseth commented 3 weeks ago

Yeah couldn't find any crates for this situation.

I do think just copying the struct is better but to provide another option, what do you think of something like this:

pub struct Request<'source> {
    // HACK: keep compiler happy when enabling the cli feature flag
    #[cfg(feature = "cli")]
    #[clap(skip)]
    _lifetime: PhantomData<&'source ()>,
    /// (...docs)
    #[cfg(feature = "cli")]
    #[clap(short = 't', long, conflicts_with = "data", allow_hyphen_values(true))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'static, str>>,
    /// (...docs)
    #[cfg(not(feature = "cli"))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'source, str>>,
    /// (...docs)
    #[cfg(feature = "cli")]
    #[clap(short = 'd', long, conflicts_with = "text")]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub data: Option<Data<'static>>,
    /// (...docs)
    #[cfg(not(feature = "cli"))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub data: Option<Data<'source>>,

As for the upstream issue - I had a look but I believe it's beyond my skills. Needs quite a bit of familiarity with how their derive macros work, and the author seems to agree having assigned it that E-hard tag. If you want to hold this off until you have a look at making a PR yourself though, that would be fine by me.

jeertmans commented 3 weeks ago

e struct is better but to provide another option,

This is not a wrong idea, but the issue is that compiling with cli will then enforce 'static lifetime everywhere. Which is something we want to avoid even in the CLI, when we read text from files with ltrs check FILES...

This is an edge case, but we don't want to have to keep a static lifetime reference to the content of each file, where we don't need it. So copying the struct is the only we to both have Clap happy and still be able to use non-static lifetime when desired.

As for the upstream issue - I had a look but I believe it's beyond my skills. Needs quite a bit of familiarity with how their derive macros work, and the author seems to agree having assigned it that E-hard tag. If you want to hold this off until you have a look at making a PR yourself though, that would be fine by me.

I agree, this is not easy, especially as it involves proc macros!

Rolv-Apneseth commented 3 weeks ago

So, it's not ideal, and I had to use the lifetime crate, but what do you think of the latest commit?

Perhaps it's better to leave those fields as String until clap allows us to avoid all the duplication?