gweis / isodate

ISO 8601 date/time parser
BSD 3-Clause "New" or "Revised" License
148 stars 58 forks source link

parse_time() is not threadsafe #54

Open fmux opened 5 years ago

fmux commented 5 years ago

I just encountered a bug that was very hard to track down: In a multithreaded environment, isotime.parse_time() sometimes fails to parse a valid ISO8601 string of the form HH:MM+HH:MM. In a console (i.e., in a single-threaded setting), the same string is parsed without error.

The problem seems to be in the function build_time_regexps() that is called at the beginning of parse_time(): It checks whether the global variable TIME_REGEX_CACHE is empty, and if so, it fills it with various compiled regexes, one after another. If control passes to another thread after some but not all regexes have been append()ed to that list, the other thread will find the list to be non-empty and hence assume it contains all the necessary compiled regexes. Hence it will succeed in parsing some time formats and fail for others.

The function isodates.parse_date() seems to be using the same technique, and so it likely suffers from the same problem. I have not looked at that function in detail though.

MikiDi commented 5 years ago

This is what's keeping me from using this (otherwise great :+1: ) library as well ... @gweis would you accept a PR that fixes this by wrapping the globals inside a closure (or do you prefer classes)?

gweis commented 5 years ago

The whole reason why this is is in a function is probably due to a not so well thought through use case. Have a look at https://github.com/gweis/isodate/blob/master/src/isodate/isodates.py#L48, where it would be possible to allow more than 4 year digits if desired.

@MikiDi not sure how you intend to fix it by wrapping it into a closure.

I am rather thinking, that initialising the TIME_REGEX_CACHE and DATE_REGEX_CACHE on module import would be a simpler solution. Something like pre populate that dictionary on module load, and add a helper methods that can be used on startup (or maybe even triggered with env vars) to add more regexes to it.

That should fix the race condition, and provides a clear entry point for applications to use other formats. The build_date_regexps cloud be that entry point; a better name wight be nice though.

(open for other ideas)

fmux commented 5 years ago

The way I understand it, you basically want a module that is configurable "on load", i. e., the ability to specify parameters like yeardigits and expanded at import time. Since Python does not allow for that, the usual solution would be to have isodates.py define a class (say, DateParser) that sets up the cache on object initialization. A less object-oriented and more functional approach would be to provide instead a factory function (something like create_date_parser(yeardigits=4, expanded=False) that constructs an anonymous function using the given parameters and returns it. I think that is what @MikiDi was referring to by "wrapping it into a closure".

Needless to say, both of these methods would change the API in a non-backwards compatible way. So a better solution might indeed be to set up the cache on import and allow for changing it later. This would mean that people who want to use the extended functionality have to jump through one more hoop.

Alternatively, you could use a threading.Lock around the cache creation. This would mean the least amount of change for the API (and be the most explicit one), so in your position I would probably take that route.