Closed shpd closed 10 years ago
So I definitely nailed it down to the change 18 days ago which is not yet present in master. Using categories.py
from master, it is classified correctly.
EDIT: The problematic statement is this new regex around here (https://github.com/Murodese/pynab/blob/development/pynab/categories.py#L281):
(regex.compile('', regex.I), [CAT_TV_FOREIGN, CAT_TV_SPORT, CAT_TV_DOCU, CAT_TV_HD, CAT_TV_SD, CAT_TV_ANIME, CAT_TV_OTHER
])
The original logic (as far as I can tell) is to first search for TV releases, which need to contain more stuff like season and episode numbers to be useful and categorizable. If this is not present, just fail and try with one of the later categories (i.e. movies), which usually do not contain any metadata in their title besides format/encoding.
If we add a "blank" regex to the TV series parent, this will catch all movies. I'd make a pull request if I had a useful idea, but so far I only suggest commenting out this blank regex.
Possible other solution: Leave it in, but remove the CAT_PARENT_TV
from some newsgroups that predominantly contain movies.
So, now I only need a way to re-run the categorization. Suggestions?
Can't even remember why I added that. I'll remove it.
You can recategorise stuff by setting them to category = null and then running scripts\process_uncategorised.py. From the mongo shell (or robomongo), something like:
db.releases.update({}, {$set:{category:null}},{multi:true});
Note that this will uncategorise all releases, so that they can all be re-categorised. You might just want to run it on a few categories or on releases from a certain group, etc. If you have a lot of releases, this'll take ages as well.
Haha, I couldn't wait and didn't think about your solution, so now I basically reinvented process_uncategorized.py
. If anyone else has trouble with that:
I made a few assumptions about the wrongly categorized releases:
query = {
"category.parent._id": 5000,
"group.name": {'$nin': ["alt.binaries.teevee", "alt.binaries.tvseries", "alt.binaries.tv"]},
"tvrage.possible": False
}
So I used the following Mongo query and operated only on those releases.
Maybe give me a little time as well - I'm going to go over a bunch of the regex and make some improvements.
Ah, cool. I just uncategorised everything and am re-categorising it to test it. Anime, in particular, should match a lot better with some new regex. The movie regex certainly needs improvement, too - movie sd/hd will match pretty much anything that it comes against.
Ok, reworked a bunch of broken grouping regex and fixed some ordering (and updated a few regex), so things should match better now. Have another go.
Actually nope, it's still dying horribly on processing a.b.multimedia releases.
There we go, was a problem with the process script. Fixed in 6f7695f25c0687afa68e2a0b5893b3405f149f76.
Reproduce on the latest develtopment commit (c863199213e43ae58be302dc1608088800f94655):
No posts get categoized as Movie, everything goes directly to "TV": Two examples:
They get their category assigned at https://github.com/Murodese/pynab/blob/development/pynab/releases.py#L233 which calls into
categories.py
. I can reproduce the error by hand in an interactive session:As you can see, if I do not pass the group name, it uses an unspecific regex and assigns the correct category. I will see if I can come up with a fix. In the meantime, I have another question on how to fix these wrongly categorized ones: