Closed lucmos closed 2 years ago
hey @lucmos,
Let me have a closer look with engineering team. We will get back here with more info.
Hey @lucmos,
Prince Canuma here, Data Scientist at Neptune.ai
This issue was caused because neptune currently only supports 32-bit integers. Meaning, somewhere in your code you might be trying to log a 64-bit integer.
Example:
run = neptune.init(...)
run["test"] = 9223372036854775807 ❌
This will generate the error you are reporting
----ClientHttpError-----------------------------------------------------------------------
Neptune server returned status 400.
Server response was:
{"code":400,"errorType":"MALFORMED_JSON_REQUEST","title":"Malformed JSON request: JSON parse error: **Numeric value (9223372036854775807) out of range of int (-2147483648 - 2147483647)**;
Possible solutions/workarounds for logging your 64-bit integer:
run = neptune.init(...)
run["test"] = str(9223372036854775807) ✅
Docs: https://docs.neptune.ai/api-reference/field-types#string
run = neptune.init(...)
run["test"].log(9223372036854775807) ✅
Docs: https://docs.neptune.ai/you-should-know/what-can-you-log-and-display#metrics-and-losses
Let me know if this solves your problem!
For now, I'm closing this issue.
If it doesn't solve it, feel free to reach out and we will re-open this issue.
Thank you for your help!
Unfortunately, I do not think this is what's causing our problem
The only log
call that are happening in the LightningModule are the following:
self.log_dict( self.train_accuracy, 'train_loss:' train_loss, on_epoch=True)
self.log_dict( self.val_accuracy, 'val_loss:' val_loss)
self.log_dict( self.test_accuracy, 'test_loss:' test_loss)
Where the accuracy is the torchmetrics.Accuracy
object.
If it can help I am using the capture_stdout=True
, capture_stderr=True
, capture_hardware_metrics=True
The problem started with PyTorchLightnign 1.5 so I fear that could be related
Interesting,
Can you send a code example that replicates the issue?
i am also getting similar error. The code is same as in #752
Global seed set to 0
validation loss accuracy 0 0.53 0.68
validation loss accuracy 1 0.44 0.74
validation loss accuracy 2 0.41 0.78
validation loss accuracy 3 0.4 0.75
validation loss accuracy 4 0.4 0.78
validation loss accuracy 5 0.38 0.79
validation loss accuracy 6 0.39 0.78
validation loss accuracy 7 0.4 0.79
Unexpected error occurred in Neptune background thread: Killing Neptune asynchronous thread. All data is safe on disk and can be later synced manually using `neptune sync` command.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 198, in upload_raw_data
response.raise_for_status()
File "/home/talha/venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: for url: https://app.neptune.ai/api/leaderboard/v1/attributes/upload?experimentId=97e1dd32-9771-4562-92a4-7ad4728c8c6b&attribute=training%2Fmodel%2Fcheckpoints%2Flast.ckpt&ext=ckpt
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
self.work()
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 177, in work
self.process_batch(batch, version)
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
result = func(self_, *args, **kwargs)
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 187, in process_batch
result = self._processor._backend.execute_operations(self._processor._run_id, batch)
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 349, in execute_operations
self._execute_upload_operations_with_400_retry(
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 414, in _execute_upload_operations_with_400_retry
raise ex
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 411, in _execute_upload_operations_with_400_retry
return self._execute_upload_operations(run_id, upload_operations)
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 374, in _execute_upload_operations
error = upload_file_attribute(
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 56, in upload_file_attribute
_upload_loop(file_chunk_stream=FileChunkStream(upload_entry),
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 162, in _upload_loop
result = _upload_loop_chunk(chunk, file_chunk_stream, query_params=query_params.copy(), **kwargs)
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 180, in _upload_loop_chunk
return upload_raw_data(data=chunk.get_data(), headers=headers, query_params=query_params, **kwargs)
File "/home/talha/venv/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 106, in wrapper
raise ClientHttpError(status_code, e.response.text) from e
neptune.new.exceptions.ClientHttpError:
----ClientHttpError-----------------------------------------------------------------------
Neptune server returned status 400.
Server response was:
{"errorType":"BAD_REQUEST","code":400,"title":"End of range is outside of given length"}
Verify the correctness of your call or contact Neptune support.
Need help?-> https://docs.neptune.ai/getting-started/getting-help
validation loss accuracy 8 0.39 0.79
validation loss accuracy 9 0.39 0.8
Hi @talhaanwarch
I'm afraid your issue is a bit different.
Are you uploading a file that is beeing appended to shortly after calling run["attribute"].upload("file")
?
neptune-client
sends your operations to our servers in async mode by default, and needs the files to stay the same length.
Try using sync api by setting wait=True
if the file changes, it should help as long as you only append to the file in the same thread as uploading it
@HubertJaworski no, i am not uploading any file
@talhaanwarch
You have 2 separate issues in here:
I would like to keep them separate so we can accurately solve them. If you don't mind let's address the main issue here which is the "Client error due to neptune not supporting 64-bit integer assignment"
But before we proceed, I'm curious because you said here that the same code that produces the #752 issue also produces the two issues above.
Are you sure it is the same code producing 3 different errors?
If not, please send me an example code that actually creates this issue here #751
@Blaizzy Sorry for the delay in the response, I managed to debug the problem and it was indeed related to the 64bit problem.
I was using an uint32
to set the seed:
cfg.train.seed = random.randrange(np.iinfo(np.uint32).max)
And I was logging the seed (not in the model) for reproducibility. Thus half of the time it gave the HTTP-400 error, depending on the sampled seed.
Great!
In that case, if you want to log this seed
you can use any of the methods I specified here: https://github.com/neptune-ai/neptune-client/issues/751#issuecomment-979793058
Personally, I think logging it as a string
should work fine and it's easy to revert it back to int
.
Let me know if you need anything else regarding this issue,
I will be closing it for now.
Describe the bug
Sometimes, during the training, there is a
ClientHttpError
raisedReproduction
I am running a minimal working example on MNIST with the new Lightning integration. I can reproduce this in two different machines.
Traceback
Environment
The output of
pip list
:The operating system you're using: Ubuntu The output of
python --version
: Python 3.8.12