neptune-ai / neptune-client

📘 The experiment tracker for foundation model training
https://neptune.ai
Apache License 2.0
585 stars 63 forks source link

BUG: ClientHttpError during training #751

Closed lucmos closed 2 years ago

lucmos commented 3 years ago

Describe the bug

Sometimes, during the training, there is a ClientHttpError raised

Reproduction

I am running a minimal working example on MNIST with the new Lightning integration. I can reproduce this in two different machines.

Traceback

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 335, in unmarshal_response
    incoming_response.swagger_result = unmarshal_response_inner(  # type: ignore
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 370, in unmarshal_response_inner
    response_spec = get_response_spec(status_code=response.status_code, op=op)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado_core/response.py", line 157, in get_response_spec
    raise MatchingResponseNotFound(
bravado_core.exception.MatchingResponseNotFound: Response specification matching http status_code 400 not found for operation Operation(executeOperations). Either add a response specification for the status_code or use a `default` response.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
    return func(*args, **kwargs)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 473, in _execute_operations
    result = self.leaderboard_client.api.executeOperations(**kwargs).response().result
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
    unmarshal_response(
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 344, in unmarshal_response
    six.reraise(
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/six.py", line 718, in reraise
    raise value.with_traceback(tb)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 335, in unmarshal_response
    incoming_response.swagger_result = unmarshal_response_inner(  # type: ignore
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado/http_future.py", line 370, in unmarshal_response_inner
    response_spec = get_response_spec(status_code=response.status_code, op=op)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/bravado_core/response.py", line 157, in get_response_spec
    raise MatchingResponseNotFound(
bravado.exception.HTTPBadRequest: 400 : Response specification matching http status_code 400 not found for operation Operation(executeOperations). Either add a response specification for the status_code or use a `default` response.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
    self.work()
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 177, in work
    self.process_batch(batch, version)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
    result = func(self_, *args, **kwargs)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 187, in process_batch
    result = self._processor._backend.execute_operations(self._processor._run_id, batch)
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 363, in execute_operations
    errors.extend(self._execute_operations(run_id, other_operations))
  File "/home/luca/miniconda3/envs/lightning-project-template/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 86, in wrapper
    raise ClientHttpError(e.status_code, e.response.text) from e
neptune.new.exceptions.ClientHttpError: 

----ClientHttpError-----------------------------------------------------------------------

Neptune server returned status 400.

Server response was:
{"code":400,"errorType":"MALFORMED_JSON_REQUEST","title":"Malformed JSON request: JSON parse error: Numeric value (2470129626) out of range of int (-2147483648 - 2147483647); nested exception is com.fasterxml.jackson.databind.JsonMappingException: Numeric value (2470129626) out of range of int (-2147483648 - 2147483647)\n at [Source: (PushbackInputStream); line: 1, column: 16817] (through reference chain: java.util.ArrayList[54]->ml.neptune.leaderboard.api.model.operation.OperationDTO[\"assignInt\"]->ml.neptune.leaderboard.api.model.operation.AssignIntDTO[\"value\"])"}

Verify the correctness of your call or contact Neptune support.

Need help?-> https://docs.neptune.ai/getting-started/getting-help

Environment

The output of pip list:

❯ pip list
Package                           Version              Location
--------------------------------- -------------------- ----------------------------------------------------------------
absl-py                           1.0.0
aiohttp                           3.8.1
aiosignal                         1.2.0
antlr4-python3-runtime            4.8
async-timeout                     4.0.1
attrs                             21.2.0
azure-core                        1.20.1
azure-storage-blob                12.9.0
backports.entry-points-selectable 1.1.1
black                             21.10b0
boto3                             1.20.5
botocore                          1.23.5
bravado                           11.0.3
bravado-core                      5.17.0
cachetools                        4.2.4
certifi                           2021.10.8
cffi                              1.15.0
cfgv                              3.3.1
charset-normalizer                2.0.7
click                             8.0.3
cloudpathlib                      0.6.2
cloudpickle                       2.0.0
coverage                          6.1.2
cryptography                      35.0.0
dacite                            1.6.0
dill                              0.3.4
distlib                           0.3.3
filelock                          3.3.2
flake8                            4.0.1
frozenlist                        1.2.0
fsspec                            2021.11.0
future                            0.18.2
ghp-import                        2.0.2
gitdb                             4.0.9
GitPython                         3.1.24
google-api-core                   2.2.2
google-auth                       2.3.3
google-auth-oauthlib              0.4.6
google-cloud-core                 2.2.1
google-cloud-storage              1.42.3
google-crc32c                     1.3.0
google-resumable-media            2.1.0
googleapis-common-protos          1.53.0
grpcio                            1.41.1
identify                          2.3.5
idna                              3.3
importlib-metadata                4.8.2
iniconfig                         1.1.1
isodate                           0.6.0
isort                             5.10.1
Jinja2                            3.0.3
jmespath                          0.10.0
jsonpointer                       2.2
jsonref                           0.2
jsonschema                        3.2.0
lightning-project-template        0.1.dev1+gc63e8fd    /home/luca/Projects/CookieTesting/lightning-project-template/src
Markdown                          3.3.4
MarkupSafe                        2.0.1
mccabe                            0.6.1
mergedeep                         1.3.4
mkapi                             1.0.14
mkdocs                            1.2.3
mkdocs-material                   7.3.6
mkdocs-material-extensions        1.0.3
monotonic                         1.6
msgpack                           1.0.2
msrest                            0.6.21
multidict                         5.2.0
mypy-extensions                   0.4.3
natsort                           8.0.0
neptune-client                    0.13.1
nodeenv                           1.6.0
numpy                             1.21.4
oauthlib                          3.1.1
olefile                           0.46
omegaconf                         2.1.1
packaging                         21.2
pandas                            1.3.4
pathspec                          0.9.0
Pillow                            8.4.0
pip                               21.2.4
platformdirs                      2.4.0
pluggy                            1.0.0
pre-commit                        2.15.0
prime-config                      0.9.3.dev24+g2885489
prime-pack                        0.3.dev61+g3c35037
prime-utils                       1.0.0
protobuf                          3.19.1
psutil                            5.8.0
py                                1.11.0
pyasn1                            0.4.8
pyasn1-modules                    0.2.8
pycodestyle                       2.8.0
pycparser                         2.21
pyDeprecate                       0.3.1
pyflakes                          2.4.0
Pygments                          2.10.0
PyJWT                             2.3.0
pymdown-extensions                9.1
pyparsing                         2.4.7
pyrsistent                        0.18.0
pytest                            6.2.5
pytest-cov                        3.0.0
python-dateutil                   2.8.2
python-dotenv                     0.19.2
pytorch-lightning                 1.5.1
pytz                              2021.3
PyYAML                            6.0
pyyaml_env_tag                    0.1
regex                             2021.11.10
requests                          2.26.0
requests-oauthlib                 1.3.0
rfc3987                           1.3.8
rsa                               4.7.2
s3transfer                        0.5.0
setuptools                        58.0.4
simplejson                        3.17.5
six                               1.16.0
smmap                             5.0.0
strict-rfc3339                    0.7
swagger-spec-validator            2.7.4
tensorboard                       2.7.0
tensorboard-data-server           0.6.1
tensorboard-plugin-wit            1.8.0
toml                              0.10.2
tomli                             1.2.2
torch                             1.9.0
torchmetrics                      0.6.0
torchvision                       0.10.0
tqdm                              4.62.3
typing-extensions                 3.10.0.2
urllib3                           1.26.7
virtualenv                        20.10.0
watchdog                          2.1.6
webcolors                         1.11.1
websocket-client                  1.2.1
Werkzeug                          2.0.2
wheel                             0.37.0
yarl                              1.7.2
zipp                              3.6.0

The operating system you're using: Ubuntu The output of python --version: Python 3.8.12

kamil-kaczmarek commented 2 years ago

hey @lucmos,

Let me have a closer look with engineering team. We will get back here with more info.

Blaizzy commented 2 years ago

Hey @lucmos,

Prince Canuma here, Data Scientist at Neptune.ai

This issue was caused because neptune currently only supports 32-bit integers. Meaning, somewhere in your code you might be trying to log a 64-bit integer.

Example:

run = neptune.init(...)
run["test"] = 9223372036854775807 ❌ 

This will generate the error you are reporting

----ClientHttpError-----------------------------------------------------------------------

Neptune server returned status 400.

Server response was:
{"code":400,"errorType":"MALFORMED_JSON_REQUEST","title":"Malformed JSON request: JSON parse error: **Numeric value (9223372036854775807) out of range of int (-2147483648 - 2147483647)**; 

Possible solutions/workarounds for logging your 64-bit integer:

run = neptune.init(...)
run["test"] = str(9223372036854775807) ✅ 

Docs: https://docs.neptune.ai/api-reference/field-types#string

Blaizzy commented 2 years ago

Let me know if this solves your problem!

For now, I'm closing this issue.

If it doesn't solve it, feel free to reach out and we will re-open this issue.

lucmos commented 2 years ago

Thank you for your help!

Unfortunately, I do not think this is what's causing our problem

The only log call that are happening in the LightningModule are the following:

self.log_dict( self.train_accuracy, 'train_loss:' train_loss, on_epoch=True)
self.log_dict( self.val_accuracy, 'val_loss:' val_loss)
self.log_dict( self.test_accuracy, 'test_loss:' test_loss)

Where the accuracy is the torchmetrics.Accuracy object.


If it can help I am using the capture_stdout=True, capture_stderr=True, capture_hardware_metrics=True

The problem started with PyTorchLightnign 1.5 so I fear that could be related

Blaizzy commented 2 years ago

Interesting,

Can you send a code example that replicates the issue?

talhaanwarch commented 2 years ago

i am also getting similar error. The code is same as in #752

Global seed set to 0
validation loss accuracy  0 0.53 0.68
validation loss accuracy  1 0.44 0.74
validation loss accuracy  2 0.41 0.78
validation loss accuracy  3 0.4 0.75
validation loss accuracy  4 0.4 0.78
validation loss accuracy  5 0.38 0.79
validation loss accuracy  6 0.39 0.78
validation loss accuracy  7 0.4 0.79
Unexpected error occurred in Neptune background thread: Killing Neptune asynchronous thread. All data is safe on disk and can be later synced manually using `neptune sync` command.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
    return func(*args, **kwargs)
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 198, in upload_raw_data
    response.raise_for_status()
  File "/home/talha/venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error:  for url: https://app.neptune.ai/api/leaderboard/v1/attributes/upload?experimentId=97e1dd32-9771-4562-92a4-7ad4728c8c6b&attribute=training%2Fmodel%2Fcheckpoints%2Flast.ckpt&ext=ckpt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
    self.work()
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 177, in work
    self.process_batch(batch, version)
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
    result = func(self_, *args, **kwargs)
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 187, in process_batch
    result = self._processor._backend.execute_operations(self._processor._run_id, batch)
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 349, in execute_operations
    self._execute_upload_operations_with_400_retry(
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 414, in _execute_upload_operations_with_400_retry
    raise ex
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 411, in _execute_upload_operations_with_400_retry
    return self._execute_upload_operations(run_id, upload_operations)
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 374, in _execute_upload_operations
    error = upload_file_attribute(
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 56, in upload_file_attribute
    _upload_loop(file_chunk_stream=FileChunkStream(upload_entry),
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 162, in _upload_loop
    result = _upload_loop_chunk(chunk, file_chunk_stream, query_params=query_params.copy(), **kwargs)
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 180, in _upload_loop_chunk
    return upload_raw_data(data=chunk.get_data(), headers=headers, query_params=query_params, **kwargs)
  File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 106, in wrapper
    raise ClientHttpError(status_code, e.response.text) from e
neptune.new.exceptions.ClientHttpError: 

----ClientHttpError-----------------------------------------------------------------------

Neptune server returned status 400.

Server response was:
{"errorType":"BAD_REQUEST","code":400,"title":"End of range is outside of given length"}

Verify the correctness of your call or contact Neptune support.

Need help?-> https://docs.neptune.ai/getting-started/getting-help

validation loss accuracy  8 0.39 0.79
validation loss accuracy  9 0.39 0.8
HubertJaworski commented 2 years ago

Hi @talhaanwarch

I'm afraid your issue is a bit different.

Are you uploading a file that is beeing appended to shortly after calling run["attribute"].upload("file")?

neptune-client sends your operations to our servers in async mode by default, and needs the files to stay the same length.

Try using sync api by setting wait=True if the file changes, it should help as long as you only append to the file in the same thread as uploading it

talhaanwarch commented 2 years ago

@HubertJaworski no, i am not uploading any file

Blaizzy commented 2 years ago

@talhaanwarch

You have 2 separate issues in here:

I would like to keep them separate so we can accurately solve them. If you don't mind let's address the main issue here which is the "Client error due to neptune not supporting 64-bit integer assignment"

Blaizzy commented 2 years ago

But before we proceed, I'm curious because you said here that the same code that produces the #752 issue also produces the two issues above.

Are you sure it is the same code producing 3 different errors?

If not, please send me an example code that actually creates this issue here #751

lucmos commented 2 years ago

@Blaizzy Sorry for the delay in the response, I managed to debug the problem and it was indeed related to the 64bit problem.

I was using an uint32 to set the seed:

cfg.train.seed = random.randrange(np.iinfo(np.uint32).max)

And I was logging the seed (not in the model) for reproducibility. Thus half of the time it gave the HTTP-400 error, depending on the sampled seed.

Blaizzy commented 2 years ago

Great!

In that case, if you want to log this seed you can use any of the methods I specified here: https://github.com/neptune-ai/neptune-client/issues/751#issuecomment-979793058

Personally, I think logging it as a string should work fine and it's easy to revert it back to int.

Blaizzy commented 2 years ago

Let me know if you need anything else regarding this issue,

I will be closing it for now.