ribbons / RadioDownloader

An easy to use application for managing podcast subscriptions and downloads.
https://nerdoftheherd.com/tools/radiodld/
GNU General Public License v3.0
15 stars 11 forks source link

Overhauled the code that strips the date from episode titles. #59

Closed ribbons closed 6 years ago

ribbons commented 11 years ago

Original report from Matt Robinson at 10:04:22 on 2009-04-13

The code that removes the episode date from episode titles has a few shortcomings, and should be overhauled. Currently, it strips dates from the start and end of the title, but misses the date if it is included in the middle of the title. There are also a number of formats which are missed, such as 13.02.2009

As the code is currently a large regular expression which is slightly hard to maintain, it would seem to be a good time to move this to a separate method or class, and write some unit tests to ensure that regressions aren't introduced when it is updated.


Imported from Bug 100 in the NerdoftheHerd.com Bugzilla.

nbl1268 commented 7 years ago

Looking over the RegEx for removing dates from episode titles and, as well as revising code to remove from middle of episode title, can see there are a few other date formats that could be removed. For example 13.02.2009; also 13 Feb 2009; 13 February 2009; 13 Feb '09; 13th Feb 2009; 13th-Feb-2009; 13feb09; 09 Feb13; 13 Feb '09 (just form looking through some podcasts I've downloaded).

Also, give any thought to using the same approach to remove the date (and maybe time) details from audio file 'title' field (not file name) as well?

In terms of updating this, it seems that another option is to simply break the single RegEx into a series of more manageable RegEx statements each handling a specific date format (or set of similar formats).

Would you mind perhaps sketching out you'd prefer this to be done? and I'll have a go at it from there?

ribbons commented 7 years ago

Would you mind perhaps sketching out you'd prefer this to be done? and I'll have a go at it from there?

I'm in the throes of a chest infection at the moment - I'll get back to this in a little while when I can think straight again. Don't worry, I've not forgotten!

ribbons commented 7 years ago

I'm back in action again - have just added some tests for the existing code & will answer your questions tomorrow.

ribbons commented 7 years ago

Looking over the RegEx for removing dates from episode titles and, as well as revising code to remove from middle of episode title, can see there are a few other date formats that could be removed. For example 13.02.2009; also 13 Feb 2009; 13 February 2009; 13 Feb '09; 13th Feb 2009; 13th-Feb-2009; 13feb09; 09 Feb13; 13 Feb '09 (just form looking through some podcasts I've downloaded).

To further complicate matters the code strips different date formats from before and after the episode title! For instance, 13 Feb 2009 is stripped after a title but not before. That seems to me like something that should be fixed!

Also, give any thought to using the same approach to remove the date (and maybe time) details from audio file 'title' field (not file name) as well?

I think that this is outside the scope of this issue, but as I made StripDateFromName accessible from outside the Radio Downloader assembly in 496192a299682fac40bb62eca6a5680ff6941d2b, the Podcast Provider could conceivably do this fairly easily by calling the function and re-tagging the downloaded file (probably controlled by a preference).

In terms of updating this, it seems that another option is to simply break the single RegEx into a series of more manageable RegEx statements each handling a specific date format (or set of similar formats).

I'd probably start in a different direction - it would be preferable to keep one (or a small number) of actual regular expressions, but build them up from a few strings so that we don't have one completely unmanageable line and don't repeat ourselves too much. It would also seem worthwhile to stop generating the regex individually for each passed date - extracting the date values and comparing them afterwards would allow us to pre-compile the regex for performance and also correctly handle stripping out nearby dates near the start or end of a month.

Would you mind perhaps sketching out you'd prefer this to be done? and I'll have a go at it from there?

I've pushed 496192a299682fac40bb62eca6a5680ff6941d2b which added a tests project to the solution and added some basic tests for the existing functionality of StripDateFromName. I'd say to make a start on refactoring the code (making sure the tests still pass), and also to add tests for the failing formats and positions before fixing them in StripDateFromName. Once you've got that far, raise a PR and I'll probably chip in with some style adjustments or additional tests and we can take it from there :smile:

Hope this all makes sense - let me know if not.

nbl1268 commented 7 years ago

Happy to leave updating the Podcast Provider to remove date from file title field as a separate issue/PR.

Looks like I've got my fork updated, hopefully without causing too much grief your way (had only single commit for the .gitignore).

Yes to some degree it all makes sense, but I'm sure I'll have some questions along the way. Haven't used Xunit for testing, so got some learning to do there :)

nbl1268 commented 7 years ago

I'd probably start in a different direction - it would be preferable to keep one (or a small number) of actual regular expressions, but build them up from a few strings so that we don't have one completely unmanageable line and don't repeat ourselves too much.

Yes that make good sense to me.

It would also seem worthwhile to stop generating the regex individually for each passed date - extracting the date values and comparing them afterwards would allow us to pre-compile the regex for performance and also correctly handle stripping out nearby dates near the start or end of a month.

Sorry, haven't quite understood what you are suggesting here. I'm thinking you are saying to remove these stripDate.ToString("yyyy", CultureInfo.InvariantCulture) entries form the regex. Is that correct?

Do you have a particular RegEx testing service that you use? I'm thinking something like http://regexr.com/ as example.

nbl1268 commented 7 years ago

Will only remove dates with the same month & year as the programme itself, but any day of the month

Building an understanding of the current regex. Is the option to not remove the date if the date is in a different month/year important?

ribbons commented 7 years ago

Sorry, haven't quite understood what you are suggesting here. I'm thinking you are saying to remove these stripDate.ToString("yyyy", CultureInfo.InvariantCulture) entries form the regex. Is that correct?

Kind of, my suggestion is to change those parts of the regex into capturing groups and then compare the captured values afterwards.

Do you have a particular RegEx testing service that you use? I'm thinking something like http://regexr.com/ as example.

Sorry, I just tend to sit and pore over them... I'm sure there are some handy services out there which would make things easier though.

Is the option to not remove the date if the date is in a different month/year important?

Yes, if the episode title contains a completely different date it is probably useful information which the user would find annoying if it were hidden.

nbl1268 commented 7 years ago

Kind of, my suggestion is to change those parts of the regex into capturing groups and then compare the captured values afterwards.

Been working on a RegEx that locates a range of date formats, at start, min and end of the string.

(((3[01]|2\d|1\d|0?\d)(st|nd|rd|th)?(\.|\-|\/| )((2\d|1\d|0?\d)|(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))(\.|\-|\/| )(\d{4}|\'\d{2}|\d{2}))( ?)|((\d{4}|\'\d{2}|\d{2})(\.|\-|\/| )((2\d|1\d|0?\d)|(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))(\.|\-|\/| )(3[01]|2\d|1\d|0?\d)(st|nd|rd|th)?( ?))|((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))(\.|\-|\/| )(3[01]|2\d|1\d|0?\d)(st|nd|rd|th)?(\.|\-|\/| )((?:(?:\d{4})|(?:\d{2})(?![\\d])|(?:\')(?:\d{2})(?![\\d])))( ?))

Have tested this a few times now using these two platforms and a set of dates contained in the attached file and i'm reasonably confident this is capturing all the formats identified so far and returning the date. https://www.debuggex.com/ and http://regexstorm.net/tester RegEx Test Dates.txt

Mind you the date is in different formats (delimiters and order of month, year etc) Haven't yet tested this back in c#. Expecting i'll probably need to add something to make it work there.

ribbons commented 7 years ago

Looks like you've been pretty thorough :smile:. Definitely will benefit from the repeated parts being split out into string constants as it looks a bit scary all in one go!

nbl1268 commented 7 years ago

Got a draft version in my fork that has those repeated parts split out; still some more testing but its looking like I've got it working (at least matching the date from the tests).

Comparing the regex match result to the stripdate and checking for similar and different dates is next. Similar date is currently within the same month, do you want to keep it that way? or have a user setting for this?

ribbons commented 7 years ago

Got a draft version in my fork that has those repeated parts split out; still some more testing but its looking like I've got it working (at least matching the date from the tests).

Okay, great - as soon as you'd like me to review your changes feel free to raise a PR (you can just keep amending it until we are both happy).

Comparing the regex match result to the stripdate and checking for similar and different dates is next. Similar date is currently within the same month, do you want to keep it that way? or have a user setting for this?

No, that was just for ease of implementation. I'd guess somewhere between 3 and 6 days would be best, might need a bit of experimentation. I don't think we need to add a user preference.

nbl1268 commented 7 years ago

Quick update - Been working through testing the various date formats are correctly detected by the regex and that the regex match date value is also shown as correct date (the test date) and thus can be used to test for similar date (+/-6 days).

The last of the options i am testing is 'yy/m/d' and 'yy/mm/d' (lazy versions of 'yyyy/mm/dd') Using test date of 2009-09-01 '09/9/1 test' gets through regex and to DateTime as 09/09/2001 '09/09/1 test' gets through regex and to DateTime as 09/09/2001

Using test date of 2009-09-13 '09/9/13 test' gets through regex and to DateTime as 09/09/2013 '09/09/13 test' gets through regex and to DateTime as 09/09/2013

Any thoughts on how best to handle these lazy date formats (or too not handle them) welcomed.

ribbons commented 7 years ago

Any thoughts on how best to handle these lazy date formats (or too not handle them) welcomed.

I wouldn't worry about these unduly - as your example shows it isn't possible to infer the correct date format 100% of the time. My gut feeling is that Y-M-D formats should generally have four digit year values to prevent this kind of ambiguity, but I dare say that varies somewhat by locale.

nbl1268 commented 7 years ago

okay, have raised PR #217 for you to have a look over the changes.

ribbons commented 6 years ago

Implemented in 8a4e36968c16a27efc9b26d733b4a7fcfec292d5.