jrnxf / thokr

✨ sleek typing tui with visualized results and historical logging
MIT License
528 stars 18 forks source link

Added wikipedia support #13

Closed margual56 closed 2 years ago

margual56 commented 2 years ago

asciicast

Summary

This PR adds support for Wikipedia!!

I added an option, -W, which will select a random article from Wikipedia and then make you type the summary of the article.

I tried to add all the logic into its own file, to avoid cluttering the main.rs file.

Changes

Why

I think this is a cool addition to the program! Typing real-world text is better than randomly generated chains of words.

In the future, an option could be added to change the article's language, or for the user to be able to provide an article title and select that one.

How

With reqwest! Wikipedia has an awesome API (here is the documentation), so I just tell it to give me a random article and then make a summary of that article. That summary is what will be displayed.

Really all the complexity is because all the error checking and parsing.

Checklist:

I tried to comment all the relevant code. I would love feedback on how you would like to handle the errors when parsing or making a request.

Hope you like this addition! πŸ˜„

margual56 commented 2 years ago

Okey, I have spotted a problem. Wikipedia contains a lot of non-latin characters, as well as "weird" accents of all types.

One solution is to just remove all non ascii characters. But I would prefer a solution that involved converting the characters to a valid format instead of removing them.

I will keep looking...

jrnxf commented 2 years ago

this is such a great idea!! Thank you for taking the time to propose this! I definitely want to merge this, but agree that supporting the odd characters is definitely a blocker. I'm guessing you're seeing the same issue I called out in the readme?

I decided not to launch thokr with languages besides english because of some odd rendering issues I was experiencing when trying to input characters with accents. It's as though I'm not able to properly input the character in raw mode. I'd love to have that figure out before shipping other languages because I personally felt the experience was a bit jarring. I'll open an bug report for it with more details and replication steps -- would love more eyes on that problem!

margual56 commented 2 years ago

I'm guessing you're seeing the same issue I called out in the readme?

I am, but that's because this PR only adds an alternate method to load text (the backend is the same, so it has the same issues too).

However, my main concern with Wikipedia was that if a non ascii character showed up, you were basically screwed πŸ˜† So an article might have Russian characters, which is fine for Russian people, but I had no way of writing them.

TLDR: Yes, the application crashes when typing a non ascii character, but there's also the problem of articles showing characters you can't type

The current solution is to just remove the non ascii characters since they would crash the program anyway πŸ˜… . I have no solution for the main rendering tho, sorry.

margual56 commented 2 years ago

I was writing a book here trying to explain myself πŸ˜† I will start over.

I looked into it and... it seems hopeless(?)

The problem

To write Chinese, Russian, etc; you need UTF-16 (16-bit). However, rust is not made with UTF-8 (8-bit) in mind so:

byte index 6 is not a char boundary; it is inside 'Γ’' (bytes 5..7)

5: core::str::slice_error_fail at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/str/mod.rs:86:9

Rust

From rust/library/core/src/str/mod.rs:

// 3. character boundary
let index = if !s.is_char_boundary(begin) { begin } else { end };
// find the character
let char_start = s.floor_char_boundary(index);
// `char_start` must be less than len and a char boundary
let ch = s[char_start..].chars().next().unwrap();
let char_range = char_start..char_start + ch.len_utf8();
panic!(
    "byte index {} is not a char boundary; it is inside {:?} (bytes {:?}) of `{}`{}",
    index, ch, char_range, s_trunc, ellipsis
);

So yes, it's calculating the char range based on its 8-bit length (UTF-16 is 16-bit).

TLDR

It seems like you can't slice a UTF-16 String... and you are using slicing in the ui.rs

I really don't know enough about rust's support for UTF-16 or how do other applications support it, but I'd say the best solution is to just remove non UTF-8 characters for now 🀣 It's either that or revamp the entire ui.rs to avoid slicing Strings.

margual56 commented 2 years ago

Hi, this PR needs some love. Could you please give feedback on what needs to be changed to get accepted?

Or if you don't want this functionality, just close the PR, it's okey πŸ‘πŸ»

jrnxf commented 2 years ago

Sorry for the late response! I definitely love this idea but think that the chances of getting unknown characters and not being able to type them is a deal breaker. I'm going to close this as I don't have time myself to work on this, but if you're still interested in seeing this through and have ideas on how to mitigate the issue let me know!