kevinlekiller / Newznab-Blacklist

Blacklist for Newznab.
19 stars 16 forks source link

block non-english stuff #6

Open thezoggy opened 11 years ago

thezoggy commented 11 years ago

if these are suppose to be case iterations of each other, there are a few differences between each string like spelling / things that are in one but not the other..

(100000, 'alt.binaries.', 'danish|deutsch|dutch|dksubs|flemish|french|hebrew|german|ita-eng|korsub|norwegian|serbian|spanish|spanisch|swedish|swesub|turkish|nl.?sub|.ita.|.japanese.', 1,1,0, 'Blocks non-english language releases.'), (100001, 'alt.binaries.', 'Danish|Deutsch|Dutch|DKsubs|Flemish|French|Hebrew|German|KorSub|Norwegian|Serbian|Spanish|Spanisch|Swedish|SweSUB|Turkish|.Japanese.', 1,1,0, 'Blocks non-english language releases.'), (100002, 'alt.binaries.*', 'DANiSH|DEUTSCH|DUTCH|DKSUBS|FLEMISH|FRENCH|HEBREW|GERMAN|KORSUB|NORWEGIAN|SERBIAN|SPANISH|SPANiSH|SWEDISH|SWEDiSH|SWESUB|TURKSIH|.GER|.JAPENESE.', 1,1,0, 'Blocks non-english language releases.'),

kevinlekiller commented 11 years ago

Yes, it's because I went on binsearch/nzbindex and looked up each one individually, looking for case sensitivity.

If you do find any that you would like added, please let me know.

soehest commented 11 years ago

Are there really any flemish releases? ;-)

kevinlekiller commented 11 years ago

100's of pages according to binsearch/nzbindex, as surprised as you haha!

soehest commented 11 years ago

lol i did not see that coming. I am surprised indeed :-)

thezoggy commented 11 years ago

TURKSIH looks to be misspelled.. unable to find any releases when searching for it on a rawsearch site.

thezoggy commented 11 years ago

here is my updated non-english entries.. left 'DE' out of the abbrv list just to protect against false positives

('alt.binaries.*', '[-.](danish|deutsch|dutch|dksubs|flemish|french|hebrew|japenese|japanese|german|ita-eng|korsub|norwegian|serbian|spanish|spanisch|swedish|swesub|turkish|DKsubs|nl\\.?sub)[-.]', 1, 1, 0, 'Blocks non-english language releases'),
('alt.binaries.*', '[-.](BL|CZ|ES|FR|GER|ITA|KOR|NL|PL|SE)[-.]', 1, 1, 0, 'Block non-english abbreviations'),
thezoggy commented 11 years ago

someone on irc asked me about all your 'various' blocking.. looking specifically:

(100010, 'alt.binaries.*', 'defa|knochen|giro|irls\\\\hybris|snoballkrigen!atkgalleria|realco|mp4sux|cytsunee|nzbroyalty', 1, 1, 0, 'Blocking various.'),

kevinlekiller commented 11 years ago

Added all but knoc[- ]?one , not enough time to test right now.

thezoggy commented 11 years ago

the recent changes you made still needs work. 'defa' is one i would have removed.. the realco, mp4sux was good to keep.

thezoggy commented 11 years ago

thought about the blacklist stuff tonight..

non-english content (alt.binaries.*)

NovaRip = looks to only do ITA releases. I see them as 'NovaRip' and 'Nov aRip' and 'Nova Rip'. Their releases sometimes are tagged with ITA but sometimes as ITA-ENG. Also, they usually have BDMux or DLMux as well.

Misfits.4x01.Ossessione.ITA.720p.BDMux.x264-NovaRip [1/1] - "Misfits.4x01.Ossessione.ITA.720p.BDMux.x264-NovaRip.nzb" yEnc (1/1)
Criminal.Minds.8x02.L.Accordo.ITA-ENG.1080p.DLMux.DD5.1.h264-NovaRip [1/1] - "Criminal.Minds.8x02.L.Accordo.ITA-ENG.1080p.DLMux.DD5.1.h264-NovaRip. nzb" yEnc (1/1)
Last.Resort.1x13.Bersaglio.Colorado.ITA-ENG.720p.DLMux.DD5.1.h264-Nova Rip [1/1] - "Last.Resort.1x13.Bersaglio.Colorado.ITA-ENG.720p.DLMux.DD5.1.h264-Nov aRip.nzb" yEnc (1/1)
Once.Upon.A.Time.2x04.Il.Coccodrillo.ITA-ENG.720p.DLMux.DD5.1.h264-Nov aRip [1/1] - "Once.Upon.A.Time.2x04.Il.Coccodrillo.ITA-ENG.720p.DLMux.DD5.1.h264-No vaRip.nzb" yEnc (1/1)
Bones.7x10.Una.Vita.Di.Umiliazioni.ITA.BDMux.x264-NovaRip [1/1] - "Bones.7x10.Una.Vita.Di.Umiliazioni.ITA.BDMux.x264-NovaRip.nzb" yEnc (1/1)

Now looking into ITA.. doing regex [-.]ITA[-.] looks to catch all the scenarios (which is already handled under a different blacklist):

john.alvarado - [1/7] - "La.Rivoluzione.di.Utena.dvd10.DVDRip.DivX.ITA-F2L.rar" yEnc (1/45)
Last.Resort.1x13.Bersaglio.Colorado.ITA-ENG.720p.DLMux.DD5.1.h264-Nova Rip [1/1] - "Last.Resort.1x13.Bersaglio.Colorado.ITA-ENG.720p.DLMux.DD5.1.h264-Nov aRip.nzb" yEnc (1/1)
thezoggy commented 11 years ago

some sample data,..

Seinpost Den Haag S01E06 NLSUBBED DUTCH - RealCo [00/34] - "Seinpost Den Haag S01E06 NLSUBBED DUTCH - RealCo.nzb" yEnc (1/1)
Tournee Generale S03E01 FLEMISH 720p HDTV - RealCo [00/43] - "Tournee Generale S03E01 FLEMISH 720p HDTV - RealCo.nzb" yEnc (1/1)
Community.S02E20.Fuereinander.geschaffen.GERMAN.DUBBED.DL.1080p.WebHD.x264-TVP [06/26] - "tvp-community-s02e20-1080p.nfo" yEnc (1/1)
Goede Tijden Slechte Tijden - S23E115 (08-02-2013) - RealCo [00/26] - "Goede Tijden Slechte Tijden - S23E115 (08-02-2013) - RealCo.nzb" yEnc (1/1)
[foreign]-[ Planet.E.Wintertraum.aus.Schneekanonen.German.DOKU.WS.HDTVRiP.XviD-UTOPiA ] [01/24] - "Planet.E.Wintertraum.aus.Schneekanonen.German.DOKU.WS.HDTVRiP.XviD-UTOPiA.par2" yEnc (1/1)
Israeli.Movie.Sof.Ha.Olam.Smola.2004.DVDRip-IL.XviD-DownRev [01/68] - "Sof.Ha.Olam.Smola.2004.DVDRip-IL.XviD-DownRev.par2" yEnc (1/1)
Little.Mrs.Pepperpot.Complete.PDTV.HebDub.XviD-Sweet-Star [47/47] - "Little.Mrs.Pepperpot.E50.PDTV.HebDub.XviD-Sweet-Star.avi" yEnc (1/203)
thezoggy commented 11 years ago

So this brings us to blacklist overlap... is it more efficient to to have two restrictive blacklists to catch all the possible variants.. or rely on one over zealous regex to try and catch more things but then have a whitelist? to counter it..

kevinlekiller commented 11 years ago

Can't quote, but @ thought about the blacklist stuff tonight..

I think that's a good idea to separate everything instead of everything being generic like it is currently.

thezoggy commented 11 years ago

so to catch all the novarip variants we can do: Nov[ a]+Rip

thezoggy commented 11 years ago

from nn trunk, tv non-english:

(seizoen|staffel|danish|flemish|(\.| |\b|\-)(HU|NZ)|dutch|Deutsch|nl\.?subbed|nl\.?sub|\.NL|\.ITA|norwegian|swedish|swesub|french|german|spanish)[\.\- \b]
\.des\.(?!moines)|Chinese\.Subbed|vostfr|Hebrew\.Dubbed|\.HEB\.|Nordic|Hebdub|NLSubs|NL\-Subs|NLSub|Deutsch| der |German | NL |staffel|videomann
(danish|flemish|nlvlaams|dutch|nl\.?sub|swedish|swesub|icelandic|finnish|french|truefrench[\.\- ](?:.dtv|dvd|br|bluray|720p|1080p|LD|dvdrip|internal|r5|bdrip|sub|cd\d|dts|dvdr)|german|nl\.?subbed|deutsch|espanol|SLOSiNH|VOSTFR|norwegian|[\.\- ]pl|pldub|norsub|[\.\- ]ITA)[\.\- ]
(french|german)$

from nn trunk, movie non-english:

(\.des\.|danish|flemish|dutch|(\.| |\b|\-)(HU|FINA)|Deutsch|nl\.?subbed|nl\.?sub|\.NL|\.ITA|norwegian|swedish|swesub|french|german|spanish)[\.\- |\b]
Chinese\.Subbed|vostfr|Hebrew\.Dubbed|\.Heb\.|Hebdub|NLSubs|NL\-Subs|NLSub|Deutsch| der |German| NL |turkish
(danish|flemish|nlvlaams|dutch|nl\.?sub|swedish|swesub|icelandic|finnish|french|truefrench[\.\- ](?:dvd|br|bluray|720p|1080p|LD|dvdrip|internal|r5|bdrip|sub|cd\d|dts|dvdr)|german|nl\.?subbed|deutsch|espanol|SLOSiNH|VOSTFR|norwegian|[\.\- ]pl|pldub|norsub|[\.\- ]ITA)[\.\- ]

so yeah i think first step is to mimic the foreign detection that nn knows.. then improve upon that

thezoggy commented 11 years ago

the blacklist regex are case insensitive.. as the nn code already does /i. fyi a much nicer version of that blacklist test is actually part of nn+ in the misc/testing.. test_blacklist.php

thezoggy commented 11 years ago

trying out on that predb dump with:

[ -.](de|es|fr|ger|ita|ko|kor|nl|pl|se)[ -.]((19|20)\d\d|(480|720|1080)(i|p)|(bd|dvd.?|sat|vhs)?rip?|(bd|dl)mux|( -.)?(dub|sub)(ed|bed)?|complete|convert|(d|h|p|s)d?tv|dirfix|docu|dual|dvbs|dvdscr|eng|(h|x).?2?64|int(ernal)?|pal|proper|repack)

false position (de.dub) but since this gets blacklist via the actual lang one (french.hdtv) i guess its safe to ignore,

misses,

overall that thing is doing pretty damn well. so just need to add xbox360 and then the music stuff? then we could just nuke PL-PROPHET|PL.HappyNY|PL-PPTCLASSiCS to catch pretty much everything else we missed (in another regex with all the foreign specific groups).

thezoggy commented 11 years ago

things the actual lang (first) regex missed (fixed by pull request):

need to be handled on their own..

thezoggy commented 11 years ago

ok submitted pull update with some of my changes

thezoggy commented 11 years ago

so looks like we need to also catch 'vost',

then false positives:

for the false positives looks like we can look for E##. by doing a look-behind and not match that scenario. (?:e\d\d.)

thezoggy commented 11 years ago

note to self:

The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

thezoggy commented 11 years ago

because we are matching the languages only if they then match a specific tag/codec/set/etc reduces false positives.. its similar how nn does it in the trunk

kevinlekiller commented 11 years ago

@nivong , Bloodline Der Killer German 2011 AC3 DVDRiP XviD-XF

It sees german, but that is not enough , it also needs something else next to german, like year (2011). vs the old blacklist that only looks for ex. serbian : A.Serbian.Film.2010.DVDRip.XviD-BDMF - [01/68] - asf-bdmf-sample.avi

thezoggy commented 11 years ago

found some issues with the current regex, working on being able to properly test it within nn. stay tuned

thezoggy commented 11 years ago

got the test_blacklist fixed.. thanks l2g! still need to test some things before moving onto the other regexes.