jamesmeneghello / pynab

Newznab-compliant Usenet Indexer written in Python, using PostgreSQL/MySQL-like.
Other
209 stars 44 forks source link

Categorization, issues with group-based regexes #58

Closed shpd closed 10 years ago

shpd commented 10 years ago

Reproduce on the latest develtopment commit (c863199213e43ae58be302dc1608088800f94655):

No posts get categoized as Movie, everything goes directly to "TV": Two examples:

Beverly.Hills.Cop.1984.1080p.BluRay.DTS-HD.x264-BARC0DE from alt.binaries.movies
Machined.Reborn.2009.1080p.BluRay.x264-RUSTED from alt.binaries.warez

They get their category assigned at https://github.com/Murodese/pynab/blob/development/pynab/releases.py#L233 which calls into categories.py. I can reproduce the error by hand in an interactive session:

>>> import pynab.categories
>>> bname1
'Beverly.Hills.Cop.1984.1080p.BluRay.DTS-HD.x264-BARC0DE'
>>> gname
'alt.binaries.movies'
>>> bname2
'Machined.Reborn.2009.1080p.BluRay.x264-RUSTED'
>>> gname2
'alt.binaries.warez'
>>> pynab.categories.determine_category(bname1, gname)
←[32m2014-03-20 12:27:23,186 - INFO - ←[39;49;0m ←[34mcategory: [Beverly.Hills.C
op.1984.1080p.BluRay.DTS-HD.x264-BARC0DE]: TV > HD (5040)←[39;49;0m
5040
>>> pynab.categories.determine_category(bname2, gname2)
←[32m2014-03-20 12:27:27,690 - INFO - ←[39;49;0m ←[34mcategory: [Machined.Reborn
.2009.1080p.BluRay.x264-RUSTED]: TV > HD (5040)←[39;49;0m
5040
>>> pynab.categories.determine_category(bname1, "")
←[32m2014-03-20 12:27:58,857 - INFO - ←[39;49;0m ←[34mcategory: [Beverly.Hills.C
op.1984.1080p.BluRay.DTS-HD.x264-BARC0DE]: Movies > BluRay (2050)←[39;49;0m
2050
>>> pynab.categories.determine_category(bname2, "")
←[32m2014-03-20 12:28:04,944 - INFO - ←[39;49;0m ←[34mcategory: [Machined.Reborn
.2009.1080p.BluRay.x264-RUSTED]: Movies > BluRay (2050)←[39;49;0m
2050

As you can see, if I do not pass the group name, it uses an unspecific regex and assigns the correct category. I will see if I can come up with a fix. In the meantime, I have another question on how to fix these wrongly categorized ones:

shpd commented 10 years ago

So I definitely nailed it down to the change 18 days ago which is not yet present in master. Using categories.py from master, it is classified correctly.

EDIT: The problematic statement is this new regex around here (https://github.com/Murodese/pynab/blob/development/pynab/categories.py#L281):

(regex.compile('', regex.I), [CAT_TV_FOREIGN, CAT_TV_SPORT, CAT_TV_DOCU, CAT_TV_HD, CAT_TV_SD, CAT_TV_ANIME, CAT_TV_OTHER
 ])

The original logic (as far as I can tell) is to first search for TV releases, which need to contain more stuff like season and episode numbers to be useful and categorizable. If this is not present, just fail and try with one of the later categories (i.e. movies), which usually do not contain any metadata in their title besides format/encoding.

If we add a "blank" regex to the TV series parent, this will catch all movies. I'd make a pull request if I had a useful idea, but so far I only suggest commenting out this blank regex. Possible other solution: Leave it in, but remove the CAT_PARENT_TV from some newsgroups that predominantly contain movies.

So, now I only need a way to re-run the categorization. Suggestions?

jamesmeneghello commented 10 years ago

Can't even remember why I added that. I'll remove it.

You can recategorise stuff by setting them to category = null and then running scripts\process_uncategorised.py. From the mongo shell (or robomongo), something like:

db.releases.update({}, {$set:{category:null}},{multi:true});

Note that this will uncategorise all releases, so that they can all be re-categorised. You might just want to run it on a few categories or on releases from a certain group, etc. If you have a lot of releases, this'll take ages as well.

shpd commented 10 years ago

Haha, I couldn't wait and didn't think about your solution, so now I basically reinvented process_uncategorized.py. If anyone else has trouble with that: I made a few assumptions about the wrongly categorized releases:

  1. They are all in TV
  2. They were already postprocessed
  3. They didn't come from any of the 'real' TV groups
query = {
        "category.parent._id": 5000,
        "group.name": {'$nin': ["alt.binaries.teevee", "alt.binaries.tvseries", "alt.binaries.tv"]},
        "tvrage.possible": False
    }

So I used the following Mongo query and operated only on those releases.

jamesmeneghello commented 10 years ago

Maybe give me a little time as well - I'm going to go over a bunch of the regex and make some improvements.

jamesmeneghello commented 10 years ago

Ah, cool. I just uncategorised everything and am re-categorising it to test it. Anime, in particular, should match a lot better with some new regex. The movie regex certainly needs improvement, too - movie sd/hd will match pretty much anything that it comes against.

jamesmeneghello commented 10 years ago

Ok, reworked a bunch of broken grouping regex and fixed some ordering (and updated a few regex), so things should match better now. Have another go.

jamesmeneghello commented 10 years ago

Actually nope, it's still dying horribly on processing a.b.multimedia releases.

jamesmeneghello commented 10 years ago

There we go, was a problem with the process script. Fixed in 6f7695f25c0687afa68e2a0b5893b3405f149f76.