martin-ueding / geo-activity-playground

Data analysis and visualization based on GPS tracked outdoor activities.
https://martin-ueding.github.io/geo-activity-playground/
MIT License
27 stars 12 forks source link

macOS quarantine issue appearing as Unicode error #83

Closed martin-ueding closed 5 months ago

martin-ueding commented 5 months ago

A macOS user has trouble opening GPX files. They have sent me the file and I can open it on Linux. There is something weird going on. This is an example traceback:

2024-01-03 21:41:10 geo_activity_playground.importers.directory ERROR Error while parsing file Activities/._route_2023-01-17_5.05pm.gpx:
Traceback (most recent call last):
  File "/home/ecki/.local/lib/python3.10/site-packages/geo_activity_playground/core/activity_parsers.py", line 32, in read_activity
    df = read_gpx_activity(path, opener)
  File "/home/ecki/.local/lib/python3.10/site-packages/geo_activity_playground/core/activity_parsers.py", line 133, in read_gpx_activity
    gpx = gpxpy.parse(f)
  File "/home/ecki/.local/lib/python3.10/site-packages/gpxpy/__init__.py", line 37, in parse
    parser = mod_parser.GPXParser(xml_or_file)
  File "/home/ecki/.local/lib/python3.10/site-packages/gpxpy/parser.py", line 70, in __init__
    self.init(xml_or_file)
  File "/home/ecki/.local/lib/python3.10/site-packages/gpxpy/parser.py", line 84, in init
    text = text.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ecki/.local/lib/python3.10/site-packages/geo_activity_playground/importers/directory.py", line 42, in import_from_directory
    timeseries = read_activity(path)
  File "/home/ecki/.local/lib/python3.10/site-packages/geo_activity_playground/core/activity_parsers.py", line 38, in read_activity
    raise ActivityParseError(f"Encoding issue with {path=}: {e}") from e
geo_activity_playground.core.activity_parsers.ActivityParseError: Encoding issue with path=PosixPath('Activities/._route_2023-01-17_5.05pm.gpx'): 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte

In order to diagnose this further, I've added a bit more logging.

martin-ueding commented 5 months ago

I got a bit of log output which contains the first 100 bytes of the file. And these are the following:

>>> b = b'\x00\x05\x16\x07\x00\x02\x00\x00Mac OS X        \x00\x02\x00\x00\x00\t\x00\x00\x002\x00\x00\x0e\xb0\x00\x00\x00\x02\x00\x00\x0e\xe2\x00\x00\x01\x1e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00ATTR\xff\xff\xef\x17\x00\x00\x0e\xe2\x00\x00\x00\x98'

We can then take a look and try to detect the character encoding:

>>> import chardet
>>> chardet.detect(b)
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

That means that this could be the Windows-1252 encoding. The user said that the files got converted many times. It could even mean that there are irrecoverable encoding errors and the data is garbled.

As these are GPX files, the data of interest will be in the ASCII section and therefore should be fine with almost any encoding. So perhaps that will work out even if there is not the perfect code page there.

Version 0.17.4 contains some experimental code with that.

martin-ueding commented 5 months ago

I've let the program emit the first 1000 bytes into the log. And there we find the string com.apple.quarantine. So we have some Apple specific feature active here. The interesting thing is that the file name is Activities/._route_2023-01-17_5.05pm.gpx, so it seems to be some hidden file. I'm not sure what this means exactly. Is there a file Activities/route_2023-01-17_5.05pm.gpx which can be read just fine? Or is that broken? I've asked the user to test a bit more.

martin-ueding commented 5 months ago

As the quarantine files start with a period, we can just skip those. That should make it more robust.