tkrajina / gpxpy

gpx-py is a python GPX parser. GPX (GPS eXchange Format) is an XML based file format for GPS tracks.
Apache License 2.0
989 stars 223 forks source link

slow #92

Closed wrohdewald closed 5 years ago

wrohdewald commented 7 years ago

GPX parsing needs a lot of time for converting the time strings in the points to datetime.datetime (calling strptime).

If I disable this by doing

import gpxpy.gpxfield as mod_gpxfield mod_gpxfield.TIME_TYPE=None

before I import from gpxpy what I need, my application is about 25% faster. That is too much for me to be ignored.

So what I would like to see is an official stable API which disables parsing for times in points. The application can do that if and when it needs datetimes. My application rarely does. Even comparing still works for the unparsed strings as long as they are formatted the same way (which they hopefully are within one file).

Or maybe gpxpy could transparently delay parsing until a datetime is really needed (a property, caching the calculation).

philshem commented 6 years ago

In addition to what you noticed, I'm having issues with the milliseconds not being recognized if the first timestamp doesn't contain them (see my comment on this issue here). So what I'm doing is actually running your mod_gpxfield.TIME_TYPE=None idea from a local installation of gpxpy and then parsing the string on my own.

Here's my function, which I found on Stackoverflow to be faster than using datetime.strptime():

def convert_time(in_string):

    ## '2017-08-10T15:59:28Z'
    ## '2017-08-10T15:59:28.200Z'

    yy = int(in_string[0:4])
    mm = int(in_string[6:7])
    dd = int(in_string[8:10])
    h = int(in_string[11:13])
    m = int(in_string[14:16])
    s = int(in_string[17:19])

    try:
        us = int(in_string[20:23])*1000
    except: # no microseconds in gpx timestamp
        us = 0

    out_date = datetime.date(year = yy, month = mm, day = dd)
    out_time = datetime.time(hour = h, minute = m, second = s, microsecond = us)
    out_datetime = datetime.datetime.combine(out_date, out_time)

    return out_datetime
wrohdewald commented 6 years ago

Your code works for seconds with 3 fractional digits. But according to the standard, any number of fractional digits is allowed. Without trying it out, I guess your code would truncate microseconds to milliseconds, and it would truncate those to one second: 1.3 and 1.45

Also, your code ignores time zone specifications.

See https://www.w3.org/TR/2001/REC-xmlschema-2-20010502/datatypes#dateTime

philshem commented 6 years ago

You're right, this fix wasn't meant to patch the code but just something to get my code working. My timestamps are well defined, but don't have 000 microseconds in every case, I've written more in another issue.. My comment is more to address the slow speed of datetime.strptime().

This is what the code loops over, to find the matching format (although I suspect not for every timestamp)

In gpxpy/gpxpy/gpxfield.py, the function parse_time() is called in the TimeConverter() class and contains a call to mod_gpx.DATE_FORMATS

# GPX date format(s) used for parsing. The T between date and time and Z after
# time are allowed, too:
DATE_FORMATS = [
    '%Y-%m-%d %H:%M:%S',
    '%Y-%m-%d %H:%M:%S.%f',
    #'%Y-%m-%d %H:%M:%S%z',
    #'%Y-%m-%d %H:%M:%S.%f%z',
]
tkrajina commented 6 years ago

OK, I should have added my 5 cents before. But unfortunately I'm just too busy to do any work on gpxpy at the moment. Sorry :(

I'm well aware that timestamp formats and the timestamp formatting speed is a problem. It always was.

It's unfortunate that DATE_FORMATS is a global variable and I don't especially like the idea of changing global variables. But what can be done is to change the parser, to have it's own copy of DATE_FORMATS, then you just need to update it as you like

parser = gpxpy.GPXParser(xml_or_file, parser=parser)
parser.date_formats = [...your formats...]
gpx = parser.parse()

Another thing that can be done quickly -- formats could be strings (in that case the usual Python's parser will be used) or functions (in that case you provide a function which will be called with the string from the XML file). If you are sure you can ignore times, you just provide lambda ts: None and it just works.

What do you think?

tkrajina commented 6 years ago

Ah, and another thing. If we allow for formats to be functions, the function can return something else than a normal Python timestamp. It can return an object which can be lazily evaluated when needed, for example.

The only problem is that for various timestamp calculations gpxpy expects timestamps.

philshem commented 6 years ago

Hi Tomo - thanks for your thoughts and hard work on this package

What do you think?

I like this idea, especially being able to disable recognizing timestamps for speed with lambda ts: None. The fix addresses the challenge of balancing speed, and GPX files with heterogeneous datetime formats.

Sort of related to this issue, but commented on the other issue, is it true that the datetime format recognition is only working on the first timestamp?

tkrajina commented 6 years ago

Not yet complete (and badly behind my promised time schedule), but my first implementation is in timestamp-parsing. Tests are still failing, but this weekend I'll finally have a few hours to finish it.

tkrajina commented 6 years ago

Made a few more changes. They are now in timestamp-parsing--2. An example how it works is in https://github.com/tkrajina/gpxpy/blob/timestamp-parsing--2/test.py#L2815 (https://github.com/tkrajina/gpxpy/blob/timestamp-parsing--2/test.py#L2824).