ouch-org / ouch

Painless compression and decompression in the terminal

https://crates.io/crates/ouch

Other

2.19k stars 75 forks source link

progress indicator #632

Open cizmazia opened 6 months ago

cizmazia commented 6 months ago

Is there a plan to add a progress indicator back? (It is a must have for me at least.) Thanks!

Was removed in #300.

marcospb19 commented 6 months ago

No real plans yet, sadly.

It would take effort to make it go nicely with multithreading and our Y/N user prompts.

~~We don't have enough contributors to do that while ensuring quality, so I prefer letting our "small workforce" focus on low-hanging fruit and polishing.~~

~~Well, if this gets enough :+1:s, maybe I'll take my shot at this lil rewrite.~~

AntoniosBarotsis commented 6 months ago

I was looking into the original pr cause it seems interesting and I had a few questions

Does that code use the size of the compression output buffer as the current progress towards the original file size? The code was a bit confusing so just want to make sure
Where is multi-threading being used? I saw 2 instances of thread::spawns for listing files (in the zip and tar list_archive methods) but that should be irrelevant as the progress bar would be more applicable to compressing files no? I also see you added some rayon imports for parallel iterators but can't see them used anywhere (I may be blind).
If this were to hypothetically be implemented again, would it make sense to even have an actual progress bar? I feel like in the case where you are compressing folders with many small files, having an even textual file count might be more convenient instead of seeing a progress bar flicker between 0 and 100%, adding both would probably be the better idea or maybe a file progress bar could appear only if the file is "sufficiently large" for it to make sense.
How would this fit with existing logs and/or the accessibility mode? Do we want to for example have a progress bar mode that only shows the bar and nothing else?

cizmazia commented 6 months ago

Do we want to for example have a progress bar mode that only shows the bar and nothing else?

I think this is the most important use case from the usability perspective, because when users are waiting for 10+ seconds, they want to perform other tasks while waiting, so they should be given feedback indicating when the task is going to be completed.

marcospb19 commented 6 months ago

@AntoniosBarotsis

I was looking into the original pr cause it seems interesting and I had a few questions

Does that code use the size of the compression output buffer as the current progress towards the original file size? The code was a bit confusing so just want to make sure

Where is multi-threading being used? I saw 2 instances of thread::spawns for listing files (in the zip and tar list_archive methods) but that should be irrelevant as the progress bar would be more applicable to compressing files no? I also see you added some rayon imports for parallel iterators but can't see them used anywhere (I may be blind).

If this were to hypothetically be implemented again, would it make sense to even have an actual progress bar? I feel like in the case where you are compressing folders with many small files, having an even textual file count might be more convenient instead of seeing a progress bar flicker between 0 and 100%, adding both would probably be the better idea or maybe a file progress bar could appear only if the file is "sufficiently large" for it to make sense.

How would this fit with existing logs and/or the accessibility mode? Do we want to for example have a progress bar mode that only shows the bar and nothing else?

I'm not completely sure :smile:, the answer is in #300 diff, but indeed, the code is confusing.
Search for "par_iter", and usages of the gzp crate (for Gzip and Snap).
I think people prefer bars instead of just a counter, so probably a bar.
two questions in one so: a. We'd disable bar for accessibility mode. b. We can have both the bar and logs, similar to cargo when compiling, it logs completed crates and keeps the last line for the updated progress (counter instead of bar, but you get the idea).

@AntoniosBarotsis, we already have a problem with multithreading not going well with our STDIN user prompts, so, I'll explain briefly my "rewrite idea".

If we want to do multithreading properly, we need a main thread that "orchestrates" the other threads, and can handle the STDIN prompts correctly. Threads would send logs via channels, main thread would be the only user of the STDOUT lock, and output all log lines before updating the bar.

Otherwise, we need another way for threads to synchronize STDIN usage, and to prevent them to write to STDOUT while the bar is being drawn.

STDIN is used when asking users to overwrite a file, for example.

The problem with each thread locking STDOUT is that this is all blocking IO, so threads stop compressing and decompressing when waiting for a lock to be freed. With that in mind, I think we need to use channels (which leads me to the orchestrator idea).

This enables a progress bar and multithreaded compression (reminder for the future: remove that env::set_current_dir call and fix paths, or else, it won't work).

:v Any thoughts on this?

AntoniosBarotsis commented 6 months ago

Some thoughts after going through your reply and the code again.

Decompressing

STDOUT

I think getting the logs to work across threads properly wouldn't be too much of an issue in principle but I don't think it would be just a few changes to the info macro either. Looking at this par_iter usage, we could use an mpsc channel, giving each thread its own sender and keeping the receiver in the main thread. My issue with this is that we will need to find all occurrences of logs that originate from here and make sure they get passed through the sender instead to avoid locking stdout.

What I am a bit hesitant with how easy it would be for someone to accidentally introduce print statements in the code that would negate this. A way around this would be to deny the usage of println and dbg with Clippy to make sure prints only happen via the info! macro.

STDIN

Stdin seems like it will be more annoying. I think a "simple" way to work around it would be the following:

have this par_iter return something like Result<DecompressStatus, Error> instead of Result<(), Error> where DecompressStatus = { Complete, NeedsUserInput }.
Filter the Complete ones (keep NeedsUserInput).
At this point we are back in the main thread and the other threads are already done so just ask for user input.

Now from what I understand, the asking for user input here will be essentially after the heavy work has been done (this being the actual decompression) and we only ask the user if they want to overwrite/create files. This means that there is no performance gain from doing this part in parallel so sequential is fine. If that is not the case, we could put stuff at Rayon's job queue.

This I think is simple enough but it does have a problem: the user would have to wait for all decompressions to finish before being asked for input on the files that need it. I think that that is not a big deal and fixing this would be a lot more complicated which is funny of me to say considering my suggested solution is almost describing a budget async runtime 😅.

[!NOTE] Small edit, technically the par_iter itself does not return that, what I meant was more like instead of using try_for_each we can use a map

Progress Bars

I think much of my confusion is that I'm not sure whether we want a progress bar for compressed single files or for stuff like zips and tarballs. I'm guessing that there is no real way of making this work (at least for decompression) without somehow keeping of how much of the file we've already read? Should decompression work the same way where we try and figure out how much of the compressed file we've read already? I'm not sure if these algorithms (de)compress as they read the file or if they need to read all of it first.

Side note, the multi example from indicatif seems perfect for this (gif from their repo). This seems to work from other threads which makes me wonder if it will cause problems when we try to include it alongside our logs.

Closing

I think the stdin/stdout stuff is pretty doable but they also seem like they should be implemented regardless of whether the progress bar returns or not.

Maybe "should" is a strong word, something like "a separate problem that could potentially be an issue" is a little more accurate.

marcospb19 commented 6 months ago

One problem regarding the STDIN part, a thread can't return NeedUserInput when asking to overwrite, (almost) no (de)compression is done in memory, it's all straight to the filesystem, and sometimes a folder needs to be overwritten in order to create the files inside.

Those threads need STDIN confirmation to then continue the work they're doing.

So no returning early is allowed, the work needs to resume :thinking:.

AntoniosBarotsis commented 6 months ago

This makes it decently more convoluted cause it looks like we would need bidirectional communication between the main and the worker threads then. Maybe we should move this to a separate issue at this point 😅. I guess a simple way to test for this now is to do something like this

ouch c README.md tmp.tar
ouch d tmp.tar tmp.tar

The logs get a bit messed up and it is not immediately obvious that u got asked about overwriting the file twice.

I don't like the idea of having a bunch of channels everywhere as I feel like that would make the code potentially more annoying to work with but at the same time, I'm not sure how else to go about this.

We could have the main thread receive messages from the worker threads that would be something like "print this string" and "show this message and ask for user input". Prints could be buffered both to not lock stdout all the time but also to allow reads to take precedence over them, this would prevent logs from looking like this for the example I showed earlier

[INFO] Created temporary directory [...]\ouch\.\.tmpfmajP3 to hold decompressed elements.
[INFO] Created temporary directory [...]\ouch\.\.tmpRYc9qN to hold decompressed elements.
[INFO] ".tmpfmajP3\\README.md" extracted. (5.99 kiB)
[INFO] ".Do you want to overwrite '.\README.md'? [Y/n] tmpRYc9qN\\README.md" extracted. (5.99 kiB)
Do you want to overwrite '.\README.md'? [Y/n] # Prints stop here but my input is actually for the line above

An advantage of sending the logs to the main thread instead of printing them directly would theoretically also be that the threads would not stop their work and wait until they can lock stdout anymore since only main would be doing that.

This would essentially be like implementing a shared message queue (or two, one for worker-> main and one for main->worker) where messages from the main to the worker threads would need to have some identifier for the correct thread to pick them up (thread-id would likely not work for this alone since threads could be reused).

marcospb19 commented 6 months ago

Exactly what I was thinking! Even though it adds complexity, I don't see a better way to move with it.

Thanks for bringing up those points, that's probably the way forward.

AntoniosBarotsis commented 6 months ago

I was tinkering with this for a bit and ended up with something:

$ cargo r -- d tmp.tar tmp.tar
Created temporary directory C:\Users\anton\Documents\Github\ouch\.\.tmpbfFY84 to hold decompressed elements.
Created temporary directory C:\Users\anton\Documents\Github\ouch\.\.tmpoShACS to hold decompressed elements.
".tmpbfFY84\\README.md" extracted. (5.99 kiB)
".tmpoShACS\\README.md" extracted. (5.99 kiB)
Do you want to overwrite '.\README.md'? [Y/n] # <-- waits here for input
Do you want to overwrite '.\README.md'? [Y/n] # <-- waits here for input
Successfully moved C:\Users\anton\Documents\Github\ouch\.\.tmpbfFY84\README.md to .\README.md.
Successfully decompressed archive in current directory.
Files unpacked: 1
Successfully moved C:\Users\anton\Documents\Github\ouch\.\.tmpoShACS\README.md to .\README.md.
Successfully decompressed archive in current directory.
Files unpacked: 1

I ended up doing this slightly differently than I was initially considering; I made a log sender/receiver channel for sending messages from worker threads to be logged but I didn't do anything for the stdin stuff we were discussing. I'm not sure if there's a point making another set of channels for requesting user input and then sending said input back since each will have to lock anyway, in retrospect it feels like its just a lot of complexity for nothing although I may be missing something here 😅.

What I did instead is lock stdout from Confirmation::ask so that other logs do not interfere with this one and then of course drop the lock once the function is done.

The logging itself is a bit different as well. Initially, I was trying to just log everything from the main thread, the problem with that was that that files.par_iter ... statement would prevent execution from progressing since try_for_each waits for tasks to finish before exiting. I instead put the logging "logic" in a rayon task. Inside that task I notify the main thread when all senders are done sending logs at which point the main thread exits.

This is incomplete as I haven't made these changes for all logs and prints, just the ones that my test would hit and I'm also not sure if I've made a large oversight somewhere that I failed to notice, hence why I didn't make this into a pr yet, I'm still at a proof of concept stage in my head. The code itself could also likely be more neat, I'm not immensely fond of the API as is right now but I didn't give it enough thought yet.

How does this help with progress bars?

Well so far, this doesn't really do anything. I'm not sure how we should deal with asking for user input while showing the progress bar but all these channels being set up already would make something like sending progress data to the logging thread trivial which could be the way to go about that.

Let me know your thoughts! Curious if this is going in a different direction than you were expecting initially or if you feel like (some of) this is a bad idea.

marcospb19 commented 6 months ago

Cool! I'd like to see that commit (even if it's half-baked and needs polishing).

I'm not sure how we should deal with asking for user input while showing the progress bar

Well, that's a good point. Taking a step back to the progress bar, what would STDIN prompts look like?

Imagine the following:

<log 1>
<log 2>
<log 3>
<progress bar>

What happens when a question arrives?

<log 1>
<log 2>
<log 3>
<progress bar>
<QUESTION HERE?>

Maybe the progress bar should freeze, perhaps change color, and say "(waiting)", when question is answered, all accumulated logs would be printed and the progress bar would be updated again.

I'm not sure if there's a point making another set of channels for requesting user input and then sending said input back since each will have to lock anyway

I think locking STDOUT will block (de)compression progress in other threads. We also know that logging from a worker thread causes #77.

I'm starting to consider having just the progress bar and removing the verbose logs, but putting them behind --verbose (just for "better defaults", we'd still need some way to orchestrate all of this), what do you think? :thinking:

AntoniosBarotsis commented 6 months ago

Performance

I think locking STDOUT will block (de)compression progress in other threads. We also know that logging from a worker thread causes https://github.com/ouch-org/ouch/issues/77.

This probably needs some more investigating to be sure but I think this may not be the case anymore since the worker threads are not blocked by the stdout lock since they send messages over the channels I added in my commit. I didn't formulate my sentence too well, I was talking about having channels that receive "read user input" requests and then send the user input back to the worker thread that asked for it because I don't see how that would be an improvement over what is currently implemented in my fork which has the worker threads block stdout themselves when they want user input. The reason I don't think this should slow them down is because, from what I understand, they used to slow down because each worker thread would try and lock stdout for their verbose logs, this is no longer the case though since they send their logs to the logging thread without the need to lock anything.

By having that middleman logging thread I think the issue should go away. Relevant code for that logging thread is here.

~Of course, actually buffering the logs would likely help a bit more, this is easy to do in the logging thread. We could have an array of size say 5, fill it up with received logs and when its full (or when all senders are dropped) we print all of them at once.~

Ok so I went ahead and implemented buffering and set the buffer size to 10 for now. This is significantly faster than before. I made very rough benchmark where I zipped one of the git repos I had locally, copied the resulting zip so I have 2 archives to decompress in parallel and timed the following: `cargo r -r -- d test1.zip test2.zip -y. Before any of my changes, this particular repo took about 50 seconds, with the changes + buffering it only took 12.

Something I did not fully expect was how increasing the buffer size by a lot (I tried 500) did not increase performance that much, it was only faster by around a second. I was expecting this to be a bit more slower because of how many files the repo had but I guess its great that it is not since we don't have to mess much with the buffer sizes haha.

Commit is here.

I have not yet tried asking for user input though but considering that with my example, I only get asked once for the entire folder, this should cause no issues.

Progress bar

Maybe the progress bar should freeze, perhaps change color, and say "(waiting)", when question is answered, all accumulated logs would be printed and the progress bar would be updated again.

I'm not sure how well progress bars work with for example the movement that would be caused in the terminal by printing the question text; in my experience things tend to break somewhat when that happens (might end up with 1 progress bar that does not get updated and a new one that does or something similar to that). I haven't used indicatif specifically so I don't know if they have this issue.

I'm starting to consider having just the progress bar and removing the verbose logs, but putting them behind --verbose (just for "better defaults", we'd still need some way to orchestrate all of this), what do you think? 🤔

Honestly this seems like a better idea.

AntoniosBarotsis commented 6 months ago

Does the implementation of this make sense to you? I could finish up changing the rest of the logs to use this system of using a channel to send the logs to the logging thread and make a PR for it

marcospb19 commented 6 months ago

Nice graph and explanation, thanks. I misread what you said but now I understand.

It does make sense and I'm surprised by the 50s -> 12s improvement, looks promising :smiley:.

Before giving a green light for a PR, I wonder if the "Logger" couldn't be replaced by "Main".

Imagine you construct a channel::<Option<Message>>(), where a worker sends None to report it has finished.

(Instead of Option<Message>, we could create an analogous enum type to improve the semantics.)

Then main can receive all channel messages and do the logging with buffering, when N "finished" messages are received, it can consider all work to be done.

What do you think? Would that also work? If so, it might be a simpler solution.

AntoniosBarotsis commented 6 months ago

The issue with that is that technically the code is blocking:

The logging itself is a bit different as well. Initially, I was trying to just log everything from the main thread, the problem with that was that that files.par_iter ... statement would prevent execution from progressing since try_for_each waits for all tasks to finish before exiting.

That is what I tried initially. The code itself would not really be much different; there wouldn't be any need for that mutex+condvar stuff I did to not keep the main thread busy waiting but everything else would be the same.

marcospb19 commented 6 months ago

Ooh that's right, I tried a better solution but had no success, we need to early-return any errors so your solution seems optimal.

Then it all sounds good to me :heavy_check_mark: :slightly_smiling_face: .

roland-5 commented 6 months ago

What happens when a question arrives?
<log 1>
<log 2>
<log 3>
<progress bar>
<QUESTION HERE?>

Wouldn't it be better if the progress bar appeared only after answering a possible question, and not before it?

AntoniosBarotsis commented 6 months ago

(almost) no (de)compression is done in memory, it's all straight to the filesystem, and sometimes a folder needs to be overwritten in order to create the files inside

If I understood this correctly then a decent amount of work needs to be done before we know whether we need to ask a question or not, it is not something we can know ahead of time, that's the issue here. We put the progress bar before the question because we know we will be making progress before we know whether a question needs to be asked.

Later on we figured that limiting logs would be best (hide all info behind a --verbose flag) but still not sure how to tackle the questions. My concern is whether the progress bar will be able to appear properly when new stuff is printed because in my experience newlines tend to screw these things up but will need to try and see.

roland-5 commented 6 months ago

Thank you for explanation. Yes this actually makes sense and is logical. I have no idea about programming, so I can't provide proper support, but I read the entire thread again and I understand the problem you wrote about.

I know and associate the progress bar with TUI applications and especially GUI applications, but I have never used a CLI application that would have a progress bar. What would it even look like?

If new text appeared and break terminal flow, couldn't the progress bar simply "break" and reappear a column below?

marcospb19 commented 6 months ago

ANSI escape codes are the way we can talk to terminals, it's how indicatif re-draws progress bars in the same line.

couldn't the progress bar simply "break" and reappear a column below?

We can clear the line where the bar is, then output a log line, and redraw the bar in the next line (I believe it is the only way).

Besides adding a "WAITING" message, we should also change the progress bar color when waiting for a question, to make it more distinguishable (should be easy to do).