ixc / python-edtf

MIT License
52 stars 19 forks source link

Deal better with approximate/uncertain dates #6

Open cogat opened 7 years ago

cogat commented 7 years ago

Currently, a fixed (precision-dependent) and arbitrary amount of fuzziness is added to approximate/uncertain dates. This is used to return a date range that can be used for filtering, but it's a pretty basic interpretation.

Properly, the level of uncertainty will be a curve that ramps continuously between 1.0 ("the given date is a certain match for the EDTF range") to 0.0 ("the given date is certainly not a match").

Open questions:

koenedaele commented 7 years ago

Hi, thank you for this library. Still checking out EDTF, but it looks like it solves a lot of issues I've been having. I know other implementations exist, but our main backend language is Python (and this seems like one of the more complete implementations anyway).

I have done research on querying fuzzy time intervals that might be of interest, have a look at http://samm.univ-paris1.fr/IMG/pdf/paris2014.pdf or https://www.researchgate.net/publication/266750215_Modelling_Imperfect_Time_in_Datasets. If you can read Dutch, have a look at http://lib.ugent.be/fulltxt/RUG01/001/418/820/RUG01-001418820_2010_0001_AC.pdf. We have implemented this for postgresql in a fairly efficient way, see https://github.com/OnroerendErfgoed/pgSFTI for a native C implementation and https://github.com/koenedaele/pgFTI for a pure SQL implementation (based on Postgis).

When implementing solutions like these I have found that the main hindrance is user adoption. Something like fuzzy sets (what my work is based on) is very hard to understand to most art historians and archaeologists. But I'm thinking that capturing information as EDTF and then generating Fuzzy Time Intervals from them might work very well.

I have no good solution for translating from ca. 1905, to a fuzzy time interval. (Does that mean probably in 1905, but possibly in 1904 or 1906; or does that mean somewhere between 1900 and 1910 but probably more towards the middle). I think it's inherently contextual and hard to solve for eavery use case.