iterative / dvclive

📈 Log and track ML metrics, parameters, models with Git and/or DVC
https://dvc.org/doc/dvclive
Apache License 2.0
161 stars 35 forks source link

`Live.log_artifact` raises `ValueError` by logging to closed `stderr` when Tensorflow is imported first #763

Open aazuspan opened 8 months ago

aazuspan commented 8 months ago

I'm running into a strange bug whenever Live.log_artifact tries to raise a warning or error with tensorflow imported. It should be reproducible below (tensorflow==2.13.0, dvc==3.38.1, and dvclive==3.5.1) as long as you have uncommitted changes in the repo (or some other way to raise a warning/error, like a missing artifact path).

import tensorflow as tf
from dvclive import Live

with Live() as live:
  live.log_artifact("file.csv")

raises

ValueError: I/O operation on closed file
Full traceback ```python --------------------------------------------------------------------------- ValueError Traceback (most recent call last) File /usr/local/lib/python3.8/dist-packages/dvc/logger.py:156, in LoggerHandler.emit(self, record) 155 msg = self.format(record) --> 156 Tqdm.write(msg, file=self.stream, end=getattr(self, "terminator", "\n")) 157 self.flush() File /usr/local/lib/python3.8/dist-packages/tqdm/std.py:725, in tqdm.write(cls, s, file, end, nolock) 723 with cls.external_write_mode(file=file, nolock=nolock): 724 # Write the message --> 725 fp.write(s) 726 fp.write(end) File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:47, in StreamWrapper.write(self, text) 46 def write(self, text): ---> 47 self.__convertor.write(text) File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:177, in AnsiToWin32.write(self, text) 176 if self.strip or self.convert: --> 177 self.write_and_convert(text) 178 else: File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:205, in AnsiToWin32.write_and_convert(self, text) 204 cursor = end --> 205 self.write_plain_text(text, cursor, len(text)) File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:210, in AnsiToWin32.write_plain_text(self, text, start, end) 209 if start < end: --> 210 self.wrapped.write(text[start:end]) 211 self.wrapped.flush() File /usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py:633, in OutStream.write(self, string) 632 msg = "I/O operation on closed file" --> 633 raise ValueError(msg) 634 else: ValueError: I/O operation on closed file During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) Cell In[1], line 5 2 from dvclive import Live 4 with Live() as live: ----> 5 live.log_artifact("file.csv") File /usr/local/lib/python3.8/dist-packages/dvclive/live.py:628, in Live.__exit__(self, exc_type, exc_val, exc_tb) 626 def __exit__(self, exc_type, exc_val, exc_tb): 627 self._inside_with = False --> 628 self.end() File /usr/local/lib/python3.8/dist-packages/dvclive/live.py:605, in Live.end(self) 600 if self._dvcyaml: 601 catch_and_warn(DvcException, logger)(self._dvc_repo.scm.add)( 602 self.dvc_file 603 ) --> 605 self.save_dvc_exp() 607 # Mark experiment as done 608 self.post_to_studio("done") File /usr/local/lib/python3.8/dist-packages/dvclive/utils.py:183, in catch_and_warn..decorator..wrapper(*args, **kwargs) 181 def wrapper(*args, **kwargs): 182 try: --> 183 return func(*args, **kwargs) 184 except exception as e: 185 logger.warning(f"Error in {func.__name__}: {e}") File /usr/local/lib/python3.8/dist-packages/dvclive/live.py:635, in Live.save_dvc_exp(self) 633 if self._dvcyaml: 634 self._include_untracked.append(self.dvc_file) --> 635 self._experiment_rev = self._dvc_repo.experiments.save( 636 name=self._exp_name, 637 include_untracked=self._include_untracked, 638 force=True, 639 message=self._exp_message, 640 ) File /usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/__init__.py:376, in Experiments.save(self, *args, **kwargs) 373 def save(self, *args, **kwargs): 374 from dvc.repo.experiments.save import save --> 376 return save(self.repo, *args, **kwargs) File /usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/save.py:33, in save(repo, name, force, include_untracked, message) 30 executor = queue.init_executor(repo.experiments, entry) 32 try: ---> 33 save_result = executor.save( 34 executor.info, 35 force=force, 36 include_untracked=include_untracked, 37 message=message, 38 ) 39 result = queue.collect_executor(repo.experiments, executor, save_result) 40 finally: File /usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/executor/base.py:311, in BaseExecutor.save(cls, info, force, include_untracked, message) 309 untracked = dvc.scm.untracked_files() 310 if untracked: --> 311 logger.warning( 312 "The following untracked files were present in " 313 "the workspace before saving but " 314 "will not be included in the experiment commit:\n" 315 "\t%s", 316 ", ".join(untracked), 317 ) 318 info.result_hash = exp_hash 319 info.result_ref = ref File /usr/lib/python3.8/logging/__init__.py:1458, in Logger.warning(self, msg, *args, **kwargs) 1449 """ 1450 Log 'msg % args' with severity 'WARNING'. 1451 (...) 1455 logger.warning("Houston, we have a %s", "bit of a problem", exc_info=1) 1456 """ 1457 if self.isEnabledFor(WARNING): -> 1458 self._log(WARNING, msg, args, **kwargs) File /usr/lib/python3.8/logging/__init__.py:1589, in Logger._log(self, level, msg, args, exc_info, extra, stack_info, stacklevel) 1586 exc_info = sys.exc_info() 1587 record = self.makeRecord(self.name, level, fn, lno, msg, args, 1588 exc_info, func, extra, sinfo) -> 1589 self.handle(record) File /usr/lib/python3.8/logging/__init__.py:1599, in Logger.handle(self, record) 1592 """ 1593 Call the handlers for the specified record. 1594 1595 This method is used for unpickled records received from a socket, as 1596 well as those created locally. Logger-level filtering is applied. 1597 """ 1598 if (not self.disabled) and self.filter(record): -> 1599 self.callHandlers(record) File /usr/lib/python3.8/logging/__init__.py:1661, in Logger.callHandlers(self, record) 1659 found = found + 1 1660 if record.levelno >= hdlr.level: -> 1661 hdlr.handle(record) 1662 if not c.propagate: 1663 c = None #break out File /usr/lib/python3.8/logging/__init__.py:954, in Handler.handle(self, record) 952 self.acquire() 953 try: --> 954 self.emit(record) 955 finally: 956 self.release() File /usr/local/lib/python3.8/dist-packages/dvc/logger.py:161, in LoggerHandler.emit(self, record) 159 raise 160 except Exception: # noqa: BLE001 --> 161 self.handleError(record) File /usr/local/lib/python3.8/dist-packages/dvc/logger.py:136, in LoggerHandler.handleError(self, record) 135 def handleError(self, record): # noqa: N802 --> 136 super().handleError(record) 137 raise LoggingException(record) File /usr/lib/python3.8/logging/__init__.py:1006, in Handler.handleError(self, record) 1004 t, v, tb = sys.exc_info() 1005 try: -> 1006 sys.stderr.write('--- Logging error ---\n') 1007 traceback.print_exception(t, v, tb, None, sys.stderr) 1008 sys.stderr.write('Call stack:\n') File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:47, in StreamWrapper.write(self, text) 46 def write(self, text): ---> 47 self.__convertor.write(text) File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:177, in AnsiToWin32.write(self, text) 175 def write(self, text): 176 if self.strip or self.convert: --> 177 self.write_and_convert(text) 178 else: 179 self.wrapped.write(text) File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:205, in AnsiToWin32.write_and_convert(self, text) 203 self.convert_ansi(*match.groups()) 204 cursor = end --> 205 self.write_plain_text(text, cursor, len(text)) File /usr/local/lib/python3.8/dist-packages/colorama/ansitowin32.py:210, in AnsiToWin32.write_plain_text(self, text, start, end) 208 def write_plain_text(self, text, start, end): 209 if start < end: --> 210 self.wrapped.write(text[start:end]) 211 self.wrapped.flush() File /usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py:633, in OutStream.write(self, string) 631 if self.pub_thread is None: 632 msg = "I/O operation on closed file" --> 633 raise ValueError(msg) 634 else: 635 is_child = not self._is_master_process() ValueError: I/O operation on closed file ```

From what I can tell from the traceback, DVC tries to raise a warning for uncommited changes, which gets logged to the closed stderr stream, causing the I/O error. The same error will occur for other errors or warnings raised by dvclive as well. I found a few hints that might help diagnose the issue below, but I think I've reached the limit of what I'm able to figure out.

  1. I've only encountered this error when tensorflow is imported before dvclive. When the order is reversed, sys.stderr remains open and no error is encountered.
  2. I found #35 that references keras-team/keras#3403. This looks closely related, but all I can tell is that it seems to be a workaround to prevent sys.stderr getting closed during testing.
  3. If you watch sys.stderr, it becomes closed as soon as gto is imported by log_artifact. Disabling that import and the related calls resolves the issue. Looking at the init for the gto.log module shows that it's setting some logging config. https://github.com/iterative/dvclive/blob/5c650ef55a0bc74b7025a861f6127ba217a07e63/src/dvclive/live.py#L498-L499

It's possible this is ultimately a Tensorflow/Keras issue rather than a DVC issue, but I'm not sure how to reproduce it without DVC.

Any help would be appreciated, thanks!

dberenbaum commented 8 months ago

Just want to confirm how severe it is for you. It sounds like you can work around the problem and are reporting more to let us know. Is that right?

aazuspan commented 8 months ago

That's right - importing dvclive before tensorflow is an easy workaround, so this doesn't break my usage.

dberenbaum commented 7 months ago

I'm not able to reproduce this so far. If I try with the code you showed and no file.csv, I only get the warning WARNING:dvclive:Error in cache: output 'file.csv' does not exist. No error is raised.

aazuspan commented 7 months ago

I didn't mention in the OP, but I've only encountered the error in notebooks. I'm able to reproduce it consistently in Colab: https://colab.research.google.com/drive/1aEkYJdQ6cYiu2ICFppg8WrJwr0y6-SG4?usp=sharing

In this case, I get the warning because there are no commits, which triggers the ValueError.

Thanks for looking into it!

dberenbaum commented 7 months ago

Looks like it may be related to https://github.com/iterative/dvc/issues/8387

dberenbaum commented 7 months ago

Confirmed that dropping this line fixes the issue:

https://github.com/iterative/dvc/blob/b46bd9ca654f5e2dbeeb31f4eb6acaf80a7fde0a/dvc/logger.py#L182

@skshetry Looks like you were testing this in https://github.com/iterative/dvc/pull/8985. Do you have any idea what's wrong or how to handle it?