libo26 / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Improved date detection #237

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
If a date handler fails to detect a date earlier on the feed this patch causes 
that date handler to be skipped for later items in the feed (which would mean 
if we earlier determine that the date format was D/M/Y that we wouldn't try 
M/D/Y since that failed earlier).  Unfortunately this doesn't fix items 
retroactively (determining that the date format was earlier misdetected when we 
find the right form later).

Original issue reported on code.google.com by EpsilonP...@hotmail.com on 3 Dec 2010 at 10:50

Attachments:

GoogleCodeExporter commented 9 years ago
This seems to be an unnecessary optimization. It's unlikely to create 
significant performance gains for most feeds. I recommend closing this bug.

Original comment by kurtmckee on 4 Dec 2010 at 7:36

GoogleCodeExporter commented 9 years ago
I'm going to leave this open for the moment.
David: do you have any performance stats about the impact of this patch?

Original comment by adewale on 4 Dec 2010 at 11:12

GoogleCodeExporter commented 9 years ago
This actually wasn't a performance optimization, but an attempt to fix 
correctness (an only a partial one at that).  We were getting a small number of 
feeds where the wrong date format was detected.  This change used the earlier 
correct failures to prevent later issues.  I can construct a sample feed for 
this.

Original comment by EpsilonP...@hotmail.com on 6 Dec 2010 at 9:51

GoogleCodeExporter commented 9 years ago
Okay, I think I'm on the same page (and I'm going to reach a conclusion based 
on that, but please let me know if I'm mistaken).

You're looking at the possibility of having dates with the day first, right? As 
mentioned above, you might not know in advance if it's M-D-Y or D-M-Y, which 
would give any software fits if it's given a list of dates like:

    12/6/2010
    4/5/2010
    3/12/2010

But if you suddenly discover one of the following dates in the same feed, you'd 
instantly know which format was in use:

    12/31/2010 (m-d-y)
    31/12/2010 (d-m-y)

If you're having to check for D-M-Y, I'm certain that you're having to modify 
feedparser's built-in date parsers, or you're registering your own using 
registerDateHandler, since the built-in date parsers don't appear to parse date 
strings that could overlap -- I'm pretty sure they're unambiguous.

Is this something that can be fixed in your application without having to 
modify feedparser to introduce what looks like a thread-unsafe global 
dictionary? There's already a bug report open complaining about feedparser 
having a thread-safety issue, and while I'm looking forward to tackling that 
soon, I'm concerned that this patch will introduce another thread-safety 
problem.

You could use a class to store state, and add a __call__() and a reset() method 
to it so that, once instantiated, it can be called like a function, but can 
also store state and be reset for the next time parse() is called. For instance:

class CustomDateTimeParser:
  def __init__(self):
    self.state = 0 # or a list of date-time parser functions
    print self.state
  def __call__(self, date):
    self.state += 1 # or modify state only when a date-time function fails
    print self.state
  def reset(self):
    self.state = 0 # or restore the full list of date-time functions
    print self.state

Then you can run code like:

>>> a = CustomDateTimeParser() # self.state = 0
0
>>> a('mdy') # increment
1
>>> a('mdy') # increment
2
>>> a('mdy') # increment
3
>>> a.reset() # reset
0

Callable class instantiations are a popular design pattern in the Django 
community from what I've read, and they work great! This would allow you to try 
M-D-Y first, and switch to D-M-Y if that fails, without having to add a global 
dictionary to feedparser.

Again, this is predicated on my interpretation of what you wrote above. If 
everything above was correct, then I don't think that adding a global 
dictionary is the right way to handle this.

Original comment by kurtmckee on 7 Dec 2010 at 5:26

GoogleCodeExporter commented 9 years ago
Your approach provides more correctness (correcting all the dates if at least 
one can be found that was wrong).  I like that.  What I don't like is that 
everyone doesn't get it for free.  Is it acceptable to put this logic in at the 
top level?  Alternatively it could be provided as a wrapper (or optional 
wrapper) at the top level.

I agree about thread safety - this was not the way to do it if you need that.

Original comment by EpsilonP...@hotmail.com on 7 Dec 2010 at 1:05

GoogleCodeExporter commented 9 years ago
Please you add some examples of feeds on the web that exhibit this bug.

Original comment by adewale on 22 Dec 2010 at 9:51

GoogleCodeExporter commented 9 years ago
I think that unhandled feed dates can be handled by developers, and that this 
particular patch will introduce thread-safety problems.

Original comment by kurtmckee on 26 Apr 2011 at 6:44