justinTM / dbx

CLI tool for advanced Databricks jobs management.
https://dbx.readthedocs.io
Other
1 stars 0 forks source link

Log output chomps first few chars when previous string contains an emoji #1

Open justinTM opened 2 years ago

justinTM commented 2 years ago

Expected Behavior

launching job with dbx launch --trace --job-output-log-level all prints notebook output periodically alongside job status output.

Current Behavior

When a line in the job output contains an emoji, the byte count offset appears to be incorrect, because the next time dbx prints job output, the first few characters are missing:

[dbx][2022-08-03 16:31:19.798] [Run Id: 167366] Latest cluster logs - [2022-08-01T12:25:52.588427][DatasetDeleter] ✅ finished deleting. Summary: {'seconds elapsed': 98.59}
[dbx][2022-08-03 16:31:24.684] [Run Id: 167366] Current run status info - result state: None, lifecycle state: RUNNING, state message: In run
[dbx][2022-08-03 16:31:44.680] [Run Id: 167366] Latest cluster logs - 022-08-03T16:31:32.661517] done. calculating statistics

specifically this is the error:

- Latest cluster logs - 022-08-03T16:31:32.661517]
+ Latest cluster logs - [2022-08-03T16:31:32.661517]

now, when the new line also contains an emoji, the byte count is off by more characters. It appears to be 3 chars per emoji but I'm not sure, maybe 2.

Steps to Reproduce (for bugs)

Context

The current implementation reads from DataBricks API endpoint /jobs/runs/get-output and seeks forward an offset number of bytes into the string response (similar to 2.0/dbfs/read endpoint's offset request parameter):

# dbx/commands/launch.py
def _read_new(self, string, byte_count_offset):
        byte_count = len(string.encode('utf-8'))
        filelike = io.StringIO(string)
        filelike.seek(byte_count_offset)
        return filelike.read(), byte_count

I'm thinking if I change to UTF-16 encoding, the byte count would be accurate.

Your Environment

justinTM commented 2 years ago

2