Closed Lechtr closed 3 years ago
š Hello @Lechtr, thank you for your interest in š YOLOv5! Please visit our āļø Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a š Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training ā Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7
. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.
@Lechtr thanks for spotting this bug! I repeated your steps and I see the same truncated output. I'm not sure what the cause might be, but perhaps our W&B expert @AyushExel can help.
@AyushExel it seems W&B is truncating output somehow in Colab. These orange areas are missing when logging is enabled.
@glenn-jocher @Lechtr thanks for reporting. Looking into this now. I was able to reproduce this only on colab. Can you also reproduce this on a local machine? For me, it works as expected on Linux Desktop. I'm trying to get to the root cause of the problem. I'll update soon
Also, my first instinct was that this is a problem related to tqdm with wandb but I cannot reproduce this in any example other than the train.py script. A simpler reproducible example would be very helpful for engineers. I tried this:
import wandb
from tqdm import tqdm
run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
print(x)
But this works fine. A simpler example would be very helpful.
Could it be an issue related to a new version of wandb? That might explain why it doesn't work in Colab, because there you have the newest version, but works on your local machine, where you might have an older version installed.
I don't have time now, but will test it later.
@Lechtr I have tested with the latest version of wandb on my local machine. It works fine. The problem seems to be only with colab
@AyushExel It is related to the latest wandb release 0.10.24.
If i try out the version before, 0.10.23, with %pip install -Iv wandb==0.10.23
, it works fine.
I will check the changes made in the new release, maybe I will find the problem.
@AyushExel In the changelog for 0.10.24 it says
Significant fixes to stdout/stderr console logging
so that seems to be the direction we are looking for.
Especially this commit https://github.com/wandb/client/commit/f2b191e6c5073abec9eb4d8efe5a26c8cdd35d1d
made some changes in the file wandb/sdk/internal/file_stream.py
that seem interesting :
The following class in there sounds like a promising area to take a closer look at (line 72):
class CRDedupeFilePolicy(DefaultFilePolicy):
"""File stream policy that removes characters that would be erased by
carriage returns.
This is what a terminal does. We use it for console output to reduce the
amount of data we need to send over the network (eg. for progress bars),
while preserving the output's appearance in the web app.
"""
but I don't understand it at first glance now, maybe you have more experience there.
Might also be an idea to open an issue at the wandb-project.
@Lechtr a reproducible example will be more helpful here and then I can get CLI engineers involved. This seems like a specific tqdm+wandb relaed issue on colab as the tqdm progress bar and status is truncated. But I was not able to reproduce a simpler example. For example I tried this:
import wandb
from tqdm import tqdm
run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
print(x)
This works as intended but the output from train.py is truncated. Are you able to reproduce this on a simpler example so that it is easier to pin-point the source?
@AyushExel Your script gives following output:
0.10.24:
0%| | 0/3 [00:00<?, ?it/s]10
100
1000
100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 3/3 [00:00<00:00, 3293.09it/s]
0.10.23:
10
100
1000
100%|āāāāāāāāāā| 3/3 [00:00<00:00, 5291.38it/s]
I assume it is not supposed to be different?
@Lechtr yes. That's the expected output. The progress bar is not getting truncated. But in train.py it gets truncated. We need a minimal example that reproduces this issue.
@Lechtr were you able to reproduce this bug on a simple example? If so, can you please share it here, it'll be helpful for debugging. If not, I can get someone to look into it nevertheless but that'll happen on Monday
@AyushExel no sorry, I could not find a simple example to reproduce.
@AyushExel I think since it's reproducible in Colab you should probably notify the CLI team and point them to their changes highlighted by https://github.com/ultralytics/yolov5/issues/2685#issuecomment-812692320, as they may have introduced a bug in notebook environments.
@glenn-jocher @Lechtr I've filed a ticket to fix this. You can expect this to get resolved in the next release
@AyushExel yay thanks!
@AyushExel @Lechtr I confirm I can reproduce this, and also I should note that the truncated output is not logged to W&B either, so it's lost/invisible in both Colab and W&B. Also note print statements (white) work fine, logger.log() statements (red) are truncated. https://wandb.ai/glenn-jocher/YOLOv5/runs/23u3uq0x/logs?workspace=user-glenn-jocher
Yes. This should be addressed in the next release. Engineers are on it.
@glenn-jocher This will be fixed by this PR in wandb CLI -> https://github.com/wandb/client/pull/2111
@AyushExel should I remove the TODO and close this then? It looks like wandb/client#2111 is not merged yet though.
@glenn-jocher I'll just confirm this with someone.
@glenn-jocher The progress bar is missing on colabs even without wandb. I've tried it without running the wandb initialization cell and the progress bar is still missing.
Here's what Fariz, the CLI engineer working on the PR had to day about this:
running it without wandb (or `WANDB_CONSOLE="off"`)? I'm sure the progress bar wont be printed in that case either.
We should be matching the behavior without wandb when wandb is initialized as well.
Now a bit more post-mortem on why the progress bar is visible when older versions of wandb are initialized but not with
current branch or when wandb is not initialized.. TQDM prints the progressbar based on window width returned by an ioctl
system call. This would be zero for subprocesses spawned in colab and would lead to a zero width progress bar in such
cases.
In older versions of wandb we fallback to 80 if the width is 0 (https://github.com/wandb/client/blob/master/wandb/sdk/lib/redirect.py#L573),
which causes the progressbar to render.
š Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 š resources:
Access additional Ultralytics ā” resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 š and Vision AI ā!
@AyushExel I've been running into an issue which I've been seeing most of the year. W&B runs with Docker do not produce correct outputs in the 'Logs' tab, as the tqdm progress bars are not handled correctly by W&B here. I couldn't find an open issue on this specifically, but I know we've discussed this before. Is there any solution to this, or one in the works? Let me know please, as this affects all of my runs (I train 100% in Docker), and complicates efforts to share runs, i.e. all of the v6.0 runs for our recent release are affected. https://wandb.ai/glenn-jocher/yolov5_v6/runs/1yyym1gq/logs?workspace
@glenn-jocher I can get someone to look into this if this is reproducible. I tried running on docker and the outputs for me are fine https://wandb.ai/cayush/yoloV5/runs/2rs2bbqg/logs?workspace=user-cayush .
I think this problem is of cloud terminals not handling the tqdm outputs correctly. I've also seen this happen a lot, but I think its because the terminal breaks and produces these broken lines. wandb just captures whatever is there in the terminal. If this happens due to wandb, then its definitely a bug. But I don't think that's the case. I might be wrong though
@AyushExel to reproduce you can use the same command I used to train the official models in docker. I don't actually view any output, I use a redirect to dev null in background with no terminal window:
python train.py --data coco.yaml --batch 64 --weights '' --cfg yolov5n.yaml --epochs 300 --img 640 --device 0 > /dev/null 2>&1 &
Training usually proceeds fine for a 3-4 epochs before this error occurs, and then the rest of training displays the error. This might be related to the W&B memory leak error also if all of this output is being transmitted to W&B.
@AyushExel I think I figured out what causes this. When training in Docker, exiting container leaving it running causes problems with console outputs for some reason
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/datasets:/usr/src/datasets $t
python train.py
# wait a few epochs
CTRL+P+Q # exit container and leave it running
Not sure what to do about this, we can't leave consoles attached as they'll eventually disconnect at some point over a few days of training.
EDIT: possible workaround:
https://stackoverflow.com/questions/38786615/docker-number-of-lines-in-terminal-changing-inside-docker
Once inside container: bash -c "stty cols $COLUMNS rows $LINES && bash"
. I tested this and it seems to work. I think Maybe I could embed this in YOLOv5 with if is_docker(): os.system(cmd)
.
EDIT2: nevermind, fix above does not work.
@AyushExel updated screenshot proving that W&B is not simply copying console output, W&B is introducing errors into the Logs. This is a GCP instance console window vs. what shows up in the W&B logs for the same training run. https://wandb.ai/glenn-jocher/YOLOv5/runs/11f90i7k/logs?workspace=user-glenn-jocher
This is reproducible on every single training run I do with Docker. Runs are never logged correctly by W&B in Logs.
@AyushExel I'm observing this same Logs output bug with W&B in Google Colab. So now I can confirm this appears in all of the environments that I train models in: Docker and Colab. I don't have any W&B runs that Log correctly.
https://wandb.ai/glenn-jocher/test_VOC_ar_thr/runs/bxwvs7by/logs?workspace=user-glenn-jocher
@glenn-jocher yes. I'll update you on this once I hear back from the team.
@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix. We'll investigate further about this issue if it still exists after the next release
@AyushExel great, glad to hear! I'll keep this open for now, let me know when the the fixes are in the latest pip package please.
@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix. We'll investigate further about this issue if it still exists after the next release
@AyushExel do you know the status of this? I've got an open TODO tag here, would be nice to remove if issue is resolved.
@glenn-jocher the network egress issue fix is verified. The ticket for the tqdm progress bars is still open as it's difficult to verify without a reproducible script. If someone encounters it again, I'll increase the urgency of the ticket by upvoting.
@AyushExel thanks for looking into this! The Logs bug is really easy to reproduce, I think basically any training will show it eventually. You can use the official Colab notebook for example and train VOC models and the bug will appear after about an hour. Here's some code you can use to reproduce in Colab:
# 1. Setup
!git clone https://github.com/ultralytics/yolov5 # clone
%cd yolov5
%pip install -qr requirements.txt # install
import torch
from yolov5 import utils
display = utils.notebook_init() # checks
# 2. Weights & Biases
%pip install -q wandb
import wandb
wandb.login()
# 3. Reproduce
!python train.py --batch 64 --weights yolov5n.pt --data VOC.yaml --epochs 50 --img 512 --nosave --hyp hyp.finetune.yaml --project yolov5_wandb_reproduce --name yolov5n
Then you can observe the Colab output vs the W&B Log: https://wandb.ai/glenn-jocher/yolov5_wandb_reproduce/runs/3lvzhcu6/logs?workspace=user-glenn-jocher
@glenn-jocher Thanks. Yes, I was able to repro and handed over this with the log files to the CLI engineers.
@glenn-jocher It's strange that it only starts to happen after a few mins. Like it doesn't usually happen with coco128 for, say 10 epochs
@AyushExel yes it's strange! Actually I think it might correlate with the number of output lines (or tqdm updates) rather than with time, because when training VOC in Colab with --cache (on high-mem Colab instances) it shows up quickly, like in 15 min around 10 epochs in. If you don't cache it takes an hour to show up, which also seems to correspond to about 10 epochs.
Seems like COCO128 with 300 epochs is just too small for it to show up (not enough tqdm updates).
Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:
git fetch && git status -uno
to check andgit pull
to update repoIf this is a custom dataset/training question you must include your
train*.jpg
,test*.jpg
andresults.png
figures, or we can not help you. You can generate these withutils.plot_results()
.š Bug
If W&B is activated, certain lines of output & progression bars are truncated. Affected lines are for example cache progress and training epoch time/progress bar.
To Reproduce (REQUIRED)
Use YOLOv5-tutorial colab. Execute:
Expected behavior
Expected output (with W&B off): caching:
training:
Actual output (with W&B on): caching:
training:
Environment
Google Colab YOLOv5-tutorial Also on my personal colab with custom data.
Additional context
If W&B is de-/activated with
!wandb disabled
or!wandb enabled
it will turn off/on this behaviour.