wandb / weave

Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.
https://wandb.me/weave
Apache License 2.0
659 stars 49 forks source link

weave.publish does not complete the call #2025

Open joanvelja opened 1 month ago

joanvelja commented 1 month ago

Hi all, the following issue arises when trying to publish a dataset from my VM cluster provider (stalling my pipeline). I have to interrupt from keyboard to stop the sleep call.

Any clue?

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Input In [53], in <cell line: 3>()
      1 data_name = f"{hparams['enc']['model_name'].split('/')[-1]}_{hparams['dataset']['name']}_{hparams['scientist']}_{START_EXP_TIME}"
      2 weave_data = weave_dataset(name=data_name, rows=completion_data)
----> 3 weave.publish(weave_data)

File /usr/local/lib/python3.9/dist-packages/weave/api.py:213, in publish(obj, name)
    210 else:
    211     save_name = obj.__class__.__name__
--> 213 ref = client._save_object(obj, save_name, "latest")
    215 if isinstance(ref, _weave_client.ObjectRef):
    216     url = urls.object_version_path(
    217         ref.entity,
    218         ref.project,
    219         ref.name,
    220         ref.digest,
    221     )

File /usr/local/lib/python3.9/dist-packages/weave/trace_sentry.py:211, in Sentry.watch.<locals>.watch_dec.<locals>.wrapper(*args, **kwargs)
    208 @functools.wraps(func)
    209 def wrapper(*args: Any, **kwargs: Any) -> Any:
    210     try:
--> 211         return func(*args, **kwargs)
    212     except Exception as e:
    213         self.exception(e)

File /usr/local/lib/python3.9/dist-packages/weave/weave_client.py:734, in WeaveClient._save_object(self, val, name, branch)
    732 @trace_sentry.global_trace_sentry.watch()
    733 def _save_object(self, val: Any, name: str, branch: str = "latest") -> ObjectRef:
--> 734     self._save_nested_objects(val, name=name)
    735     return self._save_object_basic(val, name, branch)

File /usr/local/lib/python3.9/dist-packages/weave/weave_client.py:778, in WeaveClient._save_nested_objects(self, obj, name)
    776 obj_rec = pydantic_object_record(obj)
    777 for v in obj_rec.__dict__.values():
--> 778     self._save_nested_objects(v)
    779 ref = self._save_object_basic(obj_rec, name or get_obj_name(obj_rec))
    780 obj.__dict__["ref"] = ref

File /usr/local/lib/python3.9/dist-packages/weave/weave_client.py:788, in WeaveClient._save_nested_objects(self, obj, name)
    786     obj.__dict__["ref"] = ref
    787 elif isinstance(obj, Table):
--> 788     table_ref = self._save_table(obj)
    789     obj.ref = table_ref
    790 elif isinstance_namedtuple(obj):

File /usr/local/lib/python3.9/dist-packages/weave/trace_sentry.py:211, in Sentry.watch.<locals>.watch_dec.<locals>.wrapper(*args, **kwargs)
    208 @functools.wraps(func)
    209 def wrapper(*args: Any, **kwargs: Any) -> Any:
    210     try:
--> 211         return func(*args, **kwargs)
    212     except Exception as e:
    213         self.exception(e)

File /usr/local/lib/python3.9/dist-packages/weave/weave_client.py:804, in WeaveClient._save_table(self, table)
    802 @trace_sentry.global_trace_sentry.watch()
    803 def _save_table(self, table: Table) -> TableRef:
--> 804     response = self.server.table_create(
    805         TableCreateReq(
    806             table=TableSchemaForInsert(
    807                 project_id=self._project_id(), rows=table.rows
    808             )
    809         )
    810     )
    811     return TableRef(
    812         entity=self.entity, project=self.project, digest=response.digest
    813     )

File /usr/local/lib/python3.9/dist-packages/weave/trace_server/remote_http_trace_server.py:362, in RemoteHTTPTraceServer.table_create(self, req)
    359 def table_create(
    360     self, req: t.Union[tsi.TableCreateReq, t.Dict[str, t.Any]]
    361 ) -> tsi.TableCreateRes:
--> 362     return self._generic_request(
    363         "/table/create", req, tsi.TableCreateReq, tsi.TableCreateRes
    364     )

File /usr/local/lib/python3.9/dist-packages/weave/trace_server/remote_http_trace_server.py:214, in RemoteHTTPTraceServer._generic_request(self, url, req, req_model, res_model)
    212 if isinstance(req, dict):
    213     req = req_model.model_validate(req)
--> 214 r = self._generic_request_executor(url, req)
    215 return res_model.model_validate(r.json())

File /usr/local/lib/python3.9/dist-packages/tenacity/__init__.py:336, in BaseRetrying.wraps.<locals>.wrapped_f(*args, **kw)
    334 copy = self.copy()
    335 wrapped_f.statistics = copy.statistics  # type: ignore[attr-defined]
--> 336 return copy(f, *args, **kw)

File /usr/local/lib/python3.9/dist-packages/tenacity/__init__.py:485, in Retrying.__call__(self, fn, *args, **kwargs)
    483 elif isinstance(do, DoSleep):
    484     retry_state.prepare_for_next_attempt()
--> 485     self.sleep(do)
    486 else:
    487     return do

File /usr/local/lib/python3.9/dist-packages/tenacity/nap.py:31, in sleep(seconds)
     25 def sleep(seconds: float) -> None:
     26     """
     27     Sleep strategy that delays execution for a given number of seconds.
     28 
     29     This is the default strategy, and may be mocked out for unit testing.
     30     """
---> 31     time.sleep(seconds)
jamie-rasmussen commented 1 month ago

Hi Joan, I'm sorry you're experiencing difficulties. For robustness against even extended outages we retry certain operations for up to 36 hours.

I'm not sure what the root cause here is. One option would be to edit your script to turn on more logging:

import logging
import sys

logging.basicConfig(level=logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(
    logging.Formatter(
        "%(asctime)s | %(name)s | %(levelname)s | %(message)s | %(exception)s",
    )
)
logging.getLogger().addHandler(handler)

If you are on the latest released weave package, 0.50.12, you could alternately set the environment variable os.environ["WEAVE_DEBUG_HTTP"] = "1" and it will log each HTTP request to our trace server backend to stdout. This could give us some clues about what error codes you're getting and hopefully why.