ultralytics / yolov5

YOLOv5 šŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.97k stars 16.4k forks source link

Active W&B-integration truncates output lines #2685

Closed Lechtr closed 3 years ago

Lechtr commented 3 years ago

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

If this is a custom dataset/training question you must include your train*.jpg, test*.jpg and results.png figures, or we can not help you. You can generate these with utils.plot_results().

šŸ› Bug

If W&B is activated, certain lines of output & progression bars are truncated. Affected lines are for example cache progress and training epoch time/progress bar.

To Reproduce (REQUIRED)

Use YOLOv5-tutorial colab. Execute:

Expected behavior

Expected output (with W&B off): caching:

train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 1281314.83it/s]
train: Caching images (0.1GB): 100% 128/128 [00:00<00:00, 203.89it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 613566.76it/s]
val: Caching images (0.1GB): 100% 128/128 [00:00<00:00, 147.05it/s]

training:

Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       1/2     6.65G   0.04431   0.06402     0.019    0.1273       166       640: 100% 8/8 [00:01<00:00,  5.20it/s]
               Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95: 100% 4/4 [00:01<00:00,  3.30it/s]
                 all         128         929       0.663       0.625       0.658       0.432

Actual output (with W&B on): caching:

train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 fou
train: Caching images (0.1GB): 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 128/128 [00:00<00:00, 202.00it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found
val: Caching images (0.1GB): 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 128/128 [00:00<00:00, 144.14it/s]

training:

Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       0/2     3.29G   0.04237   0.06416   0.02121    0.1277       183       640
               Class      Images      Labels           P           R      mAP@.5
                 all         128         929       0.641       0.637       0.661       0.431

Environment

Google Colab YOLOv5-tutorial Also on my personal colab with custom data.

Additional context

If W&B is de-/activated with !wandb disabled or !wandb enabled it will turn off/on this behaviour.

github-actions[bot] commented 3 years ago

šŸ‘‹ Hello @Lechtr, thank you for your interest in šŸš€ YOLOv5! Please visit our ā­ļø Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a šŸ› Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ā“ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 3 years ago

@Lechtr thanks for spotting this bug! I repeated your steps and I see the same truncated output. I'm not sure what the cause might be, but perhaps our W&B expert @AyushExel can help.

glenn-jocher commented 3 years ago

@AyushExel it seems W&B is truncating output somehow in Colab. These orange areas are missing when logging is enabled.

Screenshot 2021-04-02 at 13 22 56
AyushExel commented 3 years ago

@glenn-jocher @Lechtr thanks for reporting. Looking into this now. I was able to reproduce this only on colab. Can you also reproduce this on a local machine? For me, it works as expected on Linux Desktop. I'm trying to get to the root cause of the problem. I'll update soon

AyushExel commented 3 years ago

Also, my first instinct was that this is a problem related to tqdm with wandb but I cannot reproduce this in any example other than the train.py script. A simpler reproducible example would be very helpful for engineers. I tried this:

import wandb
from tqdm import tqdm

run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
  print(x)

But this works fine. A simpler example would be very helpful.

Lechtr commented 3 years ago

Could it be an issue related to a new version of wandb? That might explain why it doesn't work in Colab, because there you have the newest version, but works on your local machine, where you might have an older version installed.

I don't have time now, but will test it later.

AyushExel commented 3 years ago

@Lechtr I have tested with the latest version of wandb on my local machine. It works fine. The problem seems to be only with colab

Lechtr commented 3 years ago

@AyushExel It is related to the latest wandb release 0.10.24. If i try out the version before, 0.10.23, with %pip install -Iv wandb==0.10.23 , it works fine. I will check the changes made in the new release, maybe I will find the problem.

Lechtr commented 3 years ago

@AyushExel In the changelog for 0.10.24 it says

Significant fixes to stdout/stderr console logging

so that seems to be the direction we are looking for.

Especially this commit https://github.com/wandb/client/commit/f2b191e6c5073abec9eb4d8efe5a26c8cdd35d1d made some changes in the file wandb/sdk/internal/file_stream.py that seem interesting :

The following class in there sounds like a promising area to take a closer look at (line 72):

class CRDedupeFilePolicy(DefaultFilePolicy):
    """File stream policy that removes characters that would be erased by
    carriage returns.
    This is what a terminal does. We use it for console output to reduce the
    amount of data we need to send over the network (eg. for progress bars),
    while preserving the output's appearance in the web app.
    """

but I don't understand it at first glance now, maybe you have more experience there.

Might also be an idea to open an issue at the wandb-project.

AyushExel commented 3 years ago

@Lechtr a reproducible example will be more helpful here and then I can get CLI engineers involved. This seems like a specific tqdm+wandb relaed issue on colab as the tqdm progress bar and status is truncated. But I was not able to reproduce a simpler example. For example I tried this:

import wandb
from tqdm import tqdm

run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
  print(x)

This works as intended but the output from train.py is truncated. Are you able to reproduce this on a simpler example so that it is easier to pin-point the source?

Lechtr commented 3 years ago

@AyushExel Your script gives following output:

0.10.24:

  0%|                                                     | 0/3 [00:00<?, ?it/s]10
100
1000
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 3/3 [00:00<00:00, 3293.09it/s]

0.10.23:

10
100
1000
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 3/3 [00:00<00:00, 5291.38it/s]

I assume it is not supposed to be different?

AyushExel commented 3 years ago

@Lechtr yes. That's the expected output. The progress bar is not getting truncated. But in train.py it gets truncated. We need a minimal example that reproduces this issue.

AyushExel commented 3 years ago

@Lechtr were you able to reproduce this bug on a simple example? If so, can you please share it here, it'll be helpful for debugging. If not, I can get someone to look into it nevertheless but that'll happen on Monday

Lechtr commented 3 years ago

@AyushExel no sorry, I could not find a simple example to reproduce.

glenn-jocher commented 3 years ago

@AyushExel I think since it's reproducible in Colab you should probably notify the CLI team and point them to their changes highlighted by https://github.com/ultralytics/yolov5/issues/2685#issuecomment-812692320, as they may have introduced a bug in notebook environments.

AyushExel commented 3 years ago

@glenn-jocher @Lechtr I've filed a ticket to fix this. You can expect this to get resolved in the next release

glenn-jocher commented 3 years ago

@AyushExel yay thanks!

glenn-jocher commented 3 years ago

@AyushExel @Lechtr I confirm I can reproduce this, and also I should note that the truncated output is not logged to W&B either, so it's lost/invisible in both Colab and W&B. Also note print statements (white) work fine, logger.log() statements (red) are truncated. https://wandb.ai/glenn-jocher/YOLOv5/runs/23u3uq0x/logs?workspace=user-glenn-jocher

Screenshot 2021-04-08 at 23 58 38
AyushExel commented 3 years ago

Yes. This should be addressed in the next release. Engineers are on it.

AyushExel commented 3 years ago

@glenn-jocher This will be fixed by this PR in wandb CLI -> https://github.com/wandb/client/pull/2111

glenn-jocher commented 3 years ago

@AyushExel should I remove the TODO and close this then? It looks like wandb/client#2111 is not merged yet though.

AyushExel commented 3 years ago

@glenn-jocher I'll just confirm this with someone.

AyushExel commented 3 years ago

@glenn-jocher The progress bar is missing on colabs even without wandb. I've tried it without running the wandb initialization cell and the progress bar is still missing. Screenshot (104)

Here's what Fariz, the CLI engineer working on the PR had to day about this:

running it without wandb (or `WANDB_CONSOLE="off"`)? I'm sure the progress bar wont be printed in that case either. 
We should be matching the behavior without wandb when wandb is initialized as well.
Now a bit more post-mortem on why the progress bar is visible when older versions of wandb are initialized but not with 
current branch or when wandb is not initialized.. TQDM prints the progressbar based on window width returned by an ioctl
 system call. This would be zero for subprocesses spawned in colab and would lead to a zero width progress bar in such
 cases.
 In older versions of wandb we fallback to 80 if the width is 0 (https://github.com/wandb/client/blob/master/wandb/sdk/lib/redirect.py#L573), 
which causes the progressbar to render.
github-actions[bot] commented 3 years ago

šŸ‘‹ Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 šŸš€ resources:

Access additional Ultralytics āš” resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 šŸš€ and Vision AI ā­!

glenn-jocher commented 3 years ago

@AyushExel I've been running into an issue which I've been seeing most of the year. W&B runs with Docker do not produce correct outputs in the 'Logs' tab, as the tqdm progress bars are not handled correctly by W&B here. I couldn't find an open issue on this specifically, but I know we've discussed this before. Is there any solution to this, or one in the works? Let me know please, as this affects all of my runs (I train 100% in Docker), and complicates efforts to share runs, i.e. all of the v6.0 runs for our recent release are affected. https://wandb.ai/glenn-jocher/yolov5_v6/runs/1yyym1gq/logs?workspace

Screen Shot 2021-10-25 at 4 53 43 PM
AyushExel commented 3 years ago

@glenn-jocher I can get someone to look into this if this is reproducible. I tried running on docker and the outputs for me are fine https://wandb.ai/cayush/yoloV5/runs/2rs2bbqg/logs?workspace=user-cayush .

I think this problem is of cloud terminals not handling the tqdm outputs correctly. I've also seen this happen a lot, but I think its because the terminal breaks and produces these broken lines. wandb just captures whatever is there in the terminal. If this happens due to wandb, then its definitely a bug. But I don't think that's the case. I might be wrong though

glenn-jocher commented 3 years ago

@AyushExel to reproduce you can use the same command I used to train the official models in docker. I don't actually view any output, I use a redirect to dev null in background with no terminal window:

python train.py --data coco.yaml --batch 64 --weights '' --cfg yolov5n.yaml --epochs 300 --img 640 --device 0  > /dev/null 2>&1 &

Training usually proceeds fine for a 3-4 epochs before this error occurs, and then the rest of training displays the error. This might be related to the W&B memory leak error also if all of this output is being transmitted to W&B.

glenn-jocher commented 3 years ago

@AyushExel I think I figured out what causes this. When training in Docker, exiting container leaving it running causes problems with console outputs for some reason

t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/datasets:/usr/src/datasets $t
python train.py
# wait a few epochs
CTRL+P+Q  # exit container and leave it running

Not sure what to do about this, we can't leave consoles attached as they'll eventually disconnect at some point over a few days of training.

EDIT: possible workaround: https://stackoverflow.com/questions/38786615/docker-number-of-lines-in-terminal-changing-inside-docker Once inside container: bash -c "stty cols $COLUMNS rows $LINES && bash". I tested this and it seems to work. I think Maybe I could embed this in YOLOv5 with if is_docker(): os.system(cmd).

EDIT2: nevermind, fix above does not work.

Screen Shot 2021-10-27 at 4 27 41 PM
glenn-jocher commented 3 years ago

@AyushExel updated screenshot proving that W&B is not simply copying console output, W&B is introducing errors into the Logs. This is a GCP instance console window vs. what shows up in the W&B logs for the same training run. https://wandb.ai/glenn-jocher/YOLOv5/runs/11f90i7k/logs?workspace=user-glenn-jocher

This is reproducible on every single training run I do with Docker. Runs are never logged correctly by W&B in Logs.

Screenshot 2021-11-10 at 16 03 44
glenn-jocher commented 3 years ago

@AyushExel I'm observing this same Logs output bug with W&B in Google Colab. So now I can confirm this appears in all of the environments that I train models in: Docker and Colab. I don't have any W&B runs that Log correctly.

https://wandb.ai/glenn-jocher/test_VOC_ar_thr/runs/bxwvs7by/logs?workspace=user-glenn-jocher

Screenshot 2021-11-12 at 13 27 10
AyushExel commented 3 years ago

@glenn-jocher yes. I'll update you on this once I hear back from the team.

AyushExel commented 3 years ago

@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix. We'll investigate further about this issue if it still exists after the next release

glenn-jocher commented 3 years ago

@AyushExel great, glad to hear! I'll keep this open for now, let me know when the the fixes are in the latest pip package please.

glenn-jocher commented 2 years ago

@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix. We'll investigate further about this issue if it still exists after the next release

@AyushExel do you know the status of this? I've got an open TODO tag here, would be nice to remove if issue is resolved.

AyushExel commented 2 years ago

@glenn-jocher the network egress issue fix is verified. The ticket for the tqdm progress bars is still open as it's difficult to verify without a reproducible script. If someone encounters it again, I'll increase the urgency of the ticket by upvoting.

glenn-jocher commented 2 years ago

@AyushExel thanks for looking into this! The Logs bug is really easy to reproduce, I think basically any training will show it eventually. You can use the official Colab notebook for example and train VOC models and the bug will appear after about an hour. Here's some code you can use to reproduce in Colab:

# 1. Setup
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt  # install

import torch
from yolov5 import utils
display = utils.notebook_init()  # checks

# 2. Weights & Biases
%pip install -q wandb
import wandb
wandb.login()

# 3. Reproduce
!python train.py --batch 64 --weights yolov5n.pt --data VOC.yaml --epochs 50 --img 512 --nosave --hyp hyp.finetune.yaml --project yolov5_wandb_reproduce --name yolov5n

Then you can observe the Colab output vs the W&B Log: https://wandb.ai/glenn-jocher/yolov5_wandb_reproduce/runs/3lvzhcu6/logs?workspace=user-glenn-jocher

Screen Shot 2022-01-06 at 6 54 57 PM
AyushExel commented 2 years ago

@glenn-jocher Thanks. Yes, I was able to repro and handed over this with the log files to the CLI engineers.

AyushExel commented 2 years ago

@glenn-jocher It's strange that it only starts to happen after a few mins. Like it doesn't usually happen with coco128 for, say 10 epochs

glenn-jocher commented 2 years ago

@AyushExel yes it's strange! Actually I think it might correlate with the number of output lines (or tqdm updates) rather than with time, because when training VOC in Colab with --cache (on high-mem Colab instances) it shows up quickly, like in 15 min around 10 epochs in. If you don't cache it takes an hour to show up, which also seems to correspond to about 10 epochs.

Seems like COCO128 with 300 epochs is just too small for it to show up (not enough tqdm updates).