NZBGet still does not correctly handle some mangled WtFnZb nzbs

andrew-kennedy commented 1 year ago

I have an example that it fails to rename correctly, leading to the infamous abc.xyz filenames upon completion. The filenames look like this:

[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E01.The.Cinderella.Thing.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[1/8] - "" yEnc 2101113560 (1/2932)
[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E02.Dont.Call.It.a.Kidnapping.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[2/8] - "" yEnc 2241590477 (1/3128)
[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E03.Lambs.in.the.Dark.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[3/8] - "" yEnc 2112810717 (1/2948)
[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E04.He.Bought.a.Hat.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[4/8] - "" yEnc 1978951252 (1/2761)
[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E05.The.Dogcatcher.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[5/8] - "" yEnc 2320989951 (1/3238)
[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E06.Some.Lusty.Tornado.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[6/8] - "" yEnc 2307729324 (1/3220)
[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E07.Keep.Your.Enemies.Closer.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[7/8] - "" yEnc 1771669125 (1/2472)
[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E08.The.James.Bond.Clause.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[8/8] - "" yEnc 1848750401 (1/2580)

paul-chambers commented 1 year ago

Could you please provide the NZB file that produces the abc.xyz filenames?

andrew-kennedy commented 1 year ago

Can't upload it to github but here's a link to it: https://gofile.io/d/1HTwqh

paul-chambers commented 1 year ago

Thanks, Andrew. As it happens, I'm in the midst of reworking that area of code. When I first implemented it, I'd only encountered a couple of variations of 'mangled' filenames, but I have encountered several more since then. I'm still testing this new implementation, but the good news is that it already parses this variant correctly.

Interestingly, seven of the eight files described by this NZB use one format, but [2/8] uses:

[N3wZ] \6aZWVk237607\::[PRiVATE]-[WtFnZb]-[The.Diplomat.S01E02.Dont.Call.It.a.Kidnapping.1080p.NF.WEB-DL.DDP5.1.Atmos.H.264-playWEB.mkv]-[2/8] - "" yEnc 2241590477 (1/3128)

I don't think I've seen an NZB file that uses more than one variant of the 'subject' field before.

prikhi commented 1 year ago

I hit a similar issue with the attached(zipped for github uploads)

SpellForce.3.Reforced.v161554.339115-GOG-xpost.nzb.zip

paul-chambers commented 1 year ago

I'm still working on this. I have limited time available, and there's an astonishing number of formatting variations in use, seemingly all undocumented (at least I've not found any). Trying to devise a 'clean' way to handle them all with code that's not littered with special cases, since those tend to be fragile and labor-intensive to maintain.

MattPark commented 7 months ago

Here's another one, a mangled season pack -- American.Dad.S01.1080p.DSNP.WEBRip.DD.5.1.x265-EDGE2020-xpost.nzb.zip

I'm using this script:

#!/usr/bin/env python3
### NZBGET SCAN SCRIPT

# Extract filenames from subjects containing [PRiVATE]-[WtFnZb]
#
# This extensions extracts obfuscated filenames from .nzb files
# created by WtFnZb.
#
# Supported subject formats:
#
# - [PRiVATE]-[WtFnZb]-[filename]-[1/5] - "" yEnc 0 (1/1)"
#
# - [PRiVATE]-[WtFnZb]-[5]-[1/filename] - "" yEnc
#
#
# NOTE: Requires Python and lxml (sudo apt install python3-lxml python-lxml)
#

### NZBGET SCAN SCRIPT

import sys
import os
import re

# Exit codes used by NZBGet
POSTPROCESS_SUCCESS = 93
POSTPROCESS_NONE = 95
POSTPROCESS_ERROR = 94

try:
    from lxml import etree
except ImportError:
    print(u'[ERROR] Python lxml required. Please install with "sudo apt install python-lxml" or "pip install lxml".')
    sys.exit(POSTPROCESS_ERROR)

patterns = (
    re.compile(r'^(?P<prefix>.*\[PRiVATE\]-\[WtFnZb\]-)'
               r'\[(?P<total>\d+)\]-\[(?P<segment>\d+)\/(?P<filename>.{3,}?)\]'
               r'\s+-\s+""\s+yEnc\s+',
               re.MULTILINE | re.UNICODE),
    re.compile(r'^(?P<prefix>.*\[PRiVATE\]-\[WtFnZb\]-)'
               r'\[(?P<filename>.{3,}?)\]-\[(?P<segment>\d+)/(?P<total>\d+)\]'
               r'\s+-\s+""\s+yEnc\s+',
               re.MULTILINE | re.UNICODE),
    re.compile(r'^(?P<prefix>.*\[PRiVATE\]-\[WtFnZb\]-)'
               r'\[(?P<filename>.{3,}?)\]-\[(?P<segment>\d+)_(?P<total>\d+)\]'
               r'\s+-\s+""\s+yEnc\s+',
               re.MULTILINE | re.UNICODE))

nzb_dir = os.getenv('NZBNP_DIRECTORY')
nzb_filename = os.getenv('NZBNP_FILENAME')
nzb_name = os.getenv('NZBNP_NZBNAME')
nzb_file_naming = os.getenv('NZBOP_FILENAMING')

if nzb_dir is None or nzb_filename is None or nzb_name is None:
    print('Please run as NZBGet plugin')
    sys.exit(POSTPROCESS_ERROR)

if nzb_file_naming is not None and nzb_file_naming.lower() != 'nzb':
    print(u'[ERROR] NZBGet setting FileNaming (under Download Queue) '
          u'must be set to "Nzb" for this extension to work correctly, exiting.')
    sys.exit(POSTPROCESS_ERROR)

if not os.path.exists(nzb_dir):
    print('[ERROR] NZB directory doesn\'t exist, exiting')
    sys.exit(POSTPROCESS_ERROR)

if not nzb_filename.lower().endswith('.nzb'):
    print(u'[ERROR] {} is not a .nzb file.'.format(nzb_filename))
    sys.exit(POSTPROCESS_ERROR)

nzb = os.path.join(nzb_dir, nzb_filename)
if not os.path.exists(nzb):
    print('[ERROR] {nzb} doesn\'t exist, exiting'.format(nzb=nzb))
    sys.exit(POSTPROCESS_ERROR)

with open(nzb, mode='rb') as infile:
    tree = etree.parse(infile)

changed = False
file_count = 0
totals = set()
filenames = set()

for f in tree.getiterator('{http://www.newzbin.com/DTD/2003/nzb}file'):
    subject = f.get('subject')
    if subject is None:
        print(u'[DETAIL] No subject in <file>, skipping')
        continue
    file_count += 1
    result = [re.match(pattern, subject) for pattern in patterns]
    matched = [m for m in result if m is not None]
    if len(matched) == 0:
        print(u'[INFO] No pattern matching subject, exiting.')
        sys.exit(POSTPROCESS_NONE)
    elif len(matched) > 1:
        print(u'[ERROR] Multiple patterns matched, exiting.')
        sys.exit(POSTPROCESS_ERROR)
    else:
        match = matched[0].groupdict()

    if match['filename'].lower().endswith('.par2'):
        print(u'[INFO] par2 exists, exiting')
        sys.exit(POSTPROCESS_NONE)

    if int(match['segment']) > int(match['total']):
        print(u'[DETAIL] Segment index is greater then total, skipping')
        continue

    # NZBGet subject parsing changes when duplicate filenames are present
    # prefix duplicates to avoid that
    if match['filename'] in filenames:
        match['filename'] = u'{}.{}'.format(file_count, match['filename'])

    filenames.add(match['filename'])

    s = u'WtFnZb "{filename}" yEnc ({segment}/{total})'.format(
        filename = match['filename'],
        segment = match['segment'],
        total = match['total'])

    print(u'[INFO] New subject {subject}'.format(subject=s.encode('ascii', 'ignore')))
    f.set('subject', s)
    changed = True
    totals.add(int(match['total']))

if not changed:
    print(u'[WARNING] No subject changed, exiting.')
    sys.exit(POSTPROCESS_NONE)

if len(totals) != 1:
    print(u'[WARNING] Mixed values for number of total segments, exiting.')
    sys.exit(POSTPROCESS_NONE)

if totals.pop() != file_count:
    print(u'[WARNING] Listed segment count does not match <file> count, exiting.')
    sys.exit(POSTPROCESS_NONE)

org = u'{}.wtfnzb.original.processed'.format(nzb)
exists_counter = 0
while os.path.exists(org):
    exists_counter += 1
    org = u'{}.{}.wtfnzb.original.processed'.format(nzb, exists_counter)

print(u'[INFO] Preserving original nzb as {}'.format(org))
os.rename(nzb, org)

print(u'[INFO] Writing {}'.format(nzb))
with open(nzb, mode='wb') as outfile:
    outfile.write(etree.tostring(tree,
        xml_declaration=True,
        encoding=tree.docinfo.encoding,
        doctype=tree.docinfo.doctype))

sys.exit(POSTPROCESS_SUCCESS)

I think I have it installed correctly, though I don't see a dedicated area for scan scripts any more, so maybe I'm doing it wrong: {4462B039-2248-4477-90D0-71E0C439810C}

I do notice that it finds the correct filenames, but still names them abc.xyz: {A9300D40-3BF4-45F4-A3EC-0C06D2CCB528}

The .nzb file follows the expected XML structure and contains subject lines that appear to match one of the patterns the script is designed to process. For instance, the subject line:

[PRiVATE]-[WtFnZb]-[American.Dad.S01E01.Pilot.1080p.DSNP.WEBRip.DDP.5.1.H.265.-EDGE2020.mkv]-[1/7] - "" yEnc 225628476 (1/315) seems to fit the pattern:

\[PRiVATE\]-\[WtFnZb\]-\[(?P<filename>.{3,}?)\]-\[(?P<segment>\d+)/(?P<total>\d+)\] - "" yEnc This indicates that, at least for this subject line, the script should be able to extract the filename (American.Dad.S01E01.Pilot.1080p.DSNP.WEBRip.DDP.5.1.H.265.-EDGE2020.mkv) and other details correctly.

I did see WtFnZb-Renamer: NZBGet setting FileNaming (under Download Queue) must be set to "Nzb" for this extension to work correctly, exiting. but I saw a post going back to this thread: https://github.com/nzbget/nzbget/issues/795 that said if you take it off auto, it will break other stuff. How to handle?

MattPark commented 4 months ago

Someone else went a different route and developed a patch for daemon/queue/NzbFile.cpp that fixes it without an extension, but I can't tell if this works with filenaming set to auto or not.

---
 daemon/queue/NzbFile.cpp | 55 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/daemon/queue/NzbFile.cpp b/daemon/queue/NzbFile.cpp
index 0630fd93..ec201930 100644
--- a/daemon/queue/NzbFile.cpp
+++ b/daemon/queue/NzbFile.cpp
@@ -135,6 +135,61 @@ void NzbFile::ParseSubject(FileInfo* fileInfo, bool TryQuotes)
        }
    }

+    // Mainly for wtfnzb:
+    // - "[PRiVATE]-[WtFnZb]-[par.vol01+02.par2]-[5/7] - "" yEnc 2112160 (1/3)"
+    // - "[PRiVATE]-[WtFnZb]-[par.vol01+02[TaoE].par2]-[5/7] - "" yEnc 2112160 (1/3)"
+    // - "[PRiVATE]-[WtFnZb]-[32]-[5/filename.mkv]"
+    // Note that the filename itself can contain '[' and ']'.
+    if (strcasestr(subject, "wtfnzb]-[")) {
+        // Look for the last '.' on the subject line.
+        char *point = strrchr(subject, '.');
+        if (point) {
+            // Expect that the extension is only a few characters (at most 5) and alphanumeric.
+            char *end = strchr(point, ']');
+            if (end && (end - point) <= 5) {
+                // Look for the first balanced '[' before point.
+                // We can't just use the first '[' since filenames can contain '[' and ']'.
+                char *start = point;
+                int brackets = 0;
+                bool valid = false;
+                while (start > subject) {
+                    char c = *start;
+                    if (c == ']') {
+                        brackets--;
+                    } else if (c == '[') {
+                        brackets++;
+                        if (brackets == 1) {
+                            valid = true;
+                            break;
+                        }
+                    } else if (c == '/') {
+                        // for the 3rd variant.
+                        // TODO: perhaps check if brackets == 0 here.
+                        valid = true;
+                        break;
+                    }
+                    start--;
+                }
+                if (valid) {
+                    // start is pointing to the '/' or '[' preceding the filename.
+                    start++;
+                    int len = (int)(end - start);
+                    if (len >= 6 && len <= 800) {
+                        BString<1024> filename;
+                        filename.Set(start, len);
+                        fileInfo->SetFilename(filename);
+                        m_nzbInfo->AddMessage(Message::mkInfo, BString<1024>("Using WTFNZB filename: %s", *filename));
+                        // Confirm the filename so we don't use the article filename.
+                        // Normally for wtfnzb, the article filename is something like abc.xyz.nfo.
+                        // See 'useFilenameFromArticle' in QueueCoordinator.cpp.
+                        fileInfo->SetFilenameConfirmed(true);
+                        return;
+                    }
+                }
+            }
+        }
+    }
+
    if (TryQuotes)
    {
        // try to use the filename in quatation marks
-- 
2.25.1

PlaceboPRS commented 4 months ago

Not sure if I should bump this one or start a new one, but I recently reinstalled Windows and as a result installed the new version of both Nzbget and the Videosort script and I've been getting loads more obfuscated downloads, I've tried to resolve it on my own but just can't get the settings right, seems to be random episodes of things, this is the current one: NCIS.S21E05.The.Plan.1080p.AMZN.WEB-DL.DDP5.1.H.264-NTb.zip

Currently installed: 23.1-testing-20240325 Video Sort 10.1

nzbgetcom / nzbget

NZBGet still does not correctly handle some mangled WtFnZb nzbs #105