Renpy提取报错。 - Githubissues

daiaji commented 5 months ago

 python run.py
---------------------------------
{'file': 'txt', 'workpath': '/home/daiaji/renpy/test', 'engineName': 'RenPy', 'outputFormat': 3, 'outputPartMode': 0, 'nameList': '', 'regDic': None, 'outputFormatExtra': -1, 'encode': 'UTF-8-SIG', 'print': [False, True, True, True, True], 'splitParaSep': '\\r\\n', 'maxCountPerLine': 512, 'cutoff': False, 'cutoffCopy': True, 'noInput': False, 'splitAuto': False, 'ignoreSameLineCount': False, 'ignoreNotMaxCount': False, 'fixedMaxPerLine': False, 'binEncodeValid': False, 'pureText': False, 'tunnelJis': False, 'subsJis': False, 'transReplace': True, 'preReplace': False, 'skipIgnoreCtrl': False, 'skipIgnoreUnfinish': False, 'ignoreEmptyFile': True}
>>> Line 1 :  '\n'
>>> Line 2 :  '# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:6\n'
>>> Line 3 :  'translate chinese nathan_bumpinto_5c11f213:\n'
>>> Line 4 :  '\n'
>>> Line 5 :  '    # "You\'re in the middle of class when a loud, blaring alarm pierces your ears."\n'
捕获的文本: "You're in the middle of class when a loud, blaring alarm pierces your ears."
start: 7
end: 82
---------------------------提取或导入时发生错误---------------------------
Traceback (most recent call last):
  File "/home/daiaji/repo/SExtractor/main/thread.py", line 18, in run
    self.window.extractFileThread()
  File "/home/daiaji/repo/SExtractor/main/mainWindow.py", line 199, in extractFileThread
    mainExtractTxt(args)
  File "/home/daiaji/repo/SExtractor/src/main_extract_txt.py", line 70, in mainExtractTxt
    mainExtract(args, parse)
  File "/home/daiaji/repo/SExtractor/src/main_extract.py", line 717, in mainExtract
    parseImp()
  File "/home/daiaji/repo/SExtractor/src/main_extract_txt.py", line 56, in parse
    var.parseImp(var.content, var.listCtrl, dealOnce)
  File "/home/daiaji/repo/SExtractor/extract_RenPy.py", line 77, in parseImp
    if 'unfinish' in listCtrl[-1]:
                     ~~~~~~~~^^^^
IndexError: list index out of range
--------------------------------------------------------------------------
异常中断文件名: nathan2
运行时间：0.004 秒

我减少了样本。

# TODO: Translation updated at 2024-06-04 18:44

# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:6
translate chinese nathan_bumpinto_5c11f213:

    # "You're in the middle of class when a loud, blaring alarm pierces your ears."
    ""

# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:7
translate chinese nathan_bumpinto_07d44a03:

    # "There had been an announcement earlier today that the school would be holding a fire alarm test, but they had failed to mention {i}when{/i} it would occur."
    ""

# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:10
translate chinese nathan_bumpinto_29df4d07:

    # "Mr. Harley sighs from the front of the room and holds up his hands in order to settle your unruly class."
    ""

# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:11
translate chinese nathan_bumpinto_49ff7602:

    # "Mr. Harley" "Alright, alright. Calm down. Form a neat, orderly line and we'll be in and out in no time."
    "Mr. Harley" ""

# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:14
translate chinese nathan_bumpinto_725b9509:

    # m "That's what she said."
    m ""

# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:15
translate chinese nathan_bumpinto_4f17f86e:

    # "You hear Matt snickering to himself behind you as you all line up to walk into the hall."
    ""

# game/fullstartASE/Chaddley McDiggle/nathan2.rpy:19
translate chinese nathan_bumpinto_b2c95802:

    # V "If there was a {i}real{/i} fire, you can be certain I wouldn't be waiting in some line. Literally, the window is {i}right there.{/i}"
    V ""

名字提取似乎有问题。

排除非数据行用r'^ '比较好吧。名字用r'^ ".+?"'会较好吧。

satan53x commented 4 months ago

逻辑上应该是要多次提取吧。不过怎么搞，还是有点头疼。在规则里显式排除["Armors.json", "Weapons.json", "Items.json", "Skills.json"]这个列表感觉有点逆天。

没搞懂你的需求，如果["Armors.json", "Weapons.json", "Items.json", "Skills.json"]这几个要排除不需要提取code的话，你直接单独放一个文件夹啊，不要和Map文件放一起。 SE里可以选配置默认，1，2之类的，可以用不同的正则指向不同的文件夹。

satan53x commented 4 months ago

之前我那个是MapXXX.json里边既有code对话，也有UI文本，所以才是嵌套文件夹处理的，那个要麻烦一点，需要串行处理。比如：所有MapXXX.json都放D:\work，然后用配置1提取code进行翻译，生成的新json在D:\work\new。（重新生成并不会删除new文件夹文件，只会覆盖当次读取到译文的文件）把Items.json放到这个new文件夹，和new文件夹的Map一起提取，用配置2对着D:\work\new提取description之类的进行翻译。

daiaji commented 4 months ago

你的方案总体上是合理的，特别是考虑到灵活性和可维护性。但有几个地方可以进一步改进和澄清。

改进建议

明确层级关系和键名引用：
- files.name.0 这样引用数组元素的方式不够清晰，容易造成混淆。可以考虑使用更明确的键名或者路径引用方式。
排除规则的层次结构：
- 你提到 exclude_files 只作用于 name: ["*.json"]。为此，可以将 exclude_files 放在与 name 同级的位置，确保它只影响该配置项。
改进文件结构和逻辑：
- 通过更清晰的配置层次结构和键名引用方式，确保代码在处理配置时更直观。

修改后的配置文件示例

- Engine_RPGMV:
    type: json
    files:
      - name: ["Armors.json", "Weapons.json", "Items.json", "Skills.json"]
        sample:
          - 10_search: ^<root\s+>(.+?)$
          - extractKey: [description, name]
      - name: ["Enemies.json", "MapInfos.json", "Classes.json"]
        sample:
          - 10_search: ^<root\s+>(.+?)$
          - extractKey: [name]
      - name: ["*.json"]
        exclude_files: [0, 1]  # 通过索引排除前两个文件配置
        sample:
          - 00_skip: ^<.+?>$
          - 01_skip: ^<(?!code).+$
          - 10_search: ^<code102>([\S\s]+)$
          - 15_search: ^<code401>\s?(?P<name>[\S\s]+)：$
          - 16_search: ^<code401>(?P<unfinish>[\S\s]+)$
          - 20_search: ^<.+?>([^ -\[\]-~][\S\s]+)$
          - extractKey: [name, description, nickname]
- Engine_RPGVX:
    type: json
    files:
      - name: ["*.json"]
        sample:
          - 00_skip: ^<.+?>$
          - 01_skip: ^<(?!code).+$
          - 10_search: ^<code102>([\S\s]+)$
          - 15_search: ^<code401>\\C\[\d\]\[(?P<name>.+?)\]\\C\[\d\]$
          - 16_search: ^<code401>(?P<unfinish>[\S\s]+)$
          - 20_search: ^<.+?>([^ -\[\]-~][\S\s]+)$
          - extractKey: [none]

Python 代码示例

import yaml

def should_exclude(file_index, exclude_files):
    return file_index in exclude_files

def process_file(file_config):
    # 文件处理逻辑
    pass

def main():
    with open('config.yaml', 'r') as f:
        config = yaml.safe_load(f)

    for engine in config:
        for engine_name, engine_config in engine.items():
            files = engine_config.get('files', [])
            for file_index, file_config in enumerate(files):
                exclude_files = file_config.get('exclude_files', [])
                if should_exclude(file_index, exclude_files):
                    continue
                process_file(file_config)

if __name__ == "__main__":
    main()

方案优势

清晰性： 使用索引排除文件配置项，避免了复杂的键名引用。
灵活性： 可以轻松调整 exclude_files 列表以排除不同的文件配置。
可维护性： 通过清晰的配置层次和注释，使得配置文件更易读易维护。

结论

通过这些改进，你的方案将更加清晰和易用，进一步提高配置文件的灵活性和代码的可维护性。

这是LLM给的解决方案，我感觉挺靠谱的。好过隐式的排除列表。而且还比较优雅。你感觉如何？行的话，我就去施工了。

satan53x commented 4 months ago

光这样没法兼容旧的=写法。因为不光是读ini文件的问题，读ini只是默认规则，现在的用法基本是用户自己在界面里编辑正则。

satan53x commented 4 months ago

其实可以像Custom和None选项一样，单独处理逻辑，ini里加一个Yaml选项，然后读取yaml文件。（只读取成字符串，还没有解析）现在从GUI窗口mainWindow到逻辑脚本extract脚本后只解析了一次参数，是在extract里按=解析的，只有这部分需要修改。（就是发现是yaml就使用yaml解析，这样就不影响之前的了）

daiaji commented 4 months ago

逻辑上应该是要多次提取吧。不过怎么搞，还是有点头疼。在规则里显式排除["Armors.json", "Weapons.json", "Items.json", "Skills.json"]这个列表感觉有点逆天。

没搞懂你的需求，如果["Armors.json", "Weapons.json", "Items.json", "Skills.json"]这几个要排除不需要提取code的话，你直接单独放一个文件夹啊，不要和Map文件放一起。 SE里可以选配置默认，1，2之类的，可以用不同的正则指向不同的文件夹。

我就是不想这么搞，本身RM的不同文件就有不同的处理方法，按照文件名处理就好，主要目的还是进一步自动化。这也是挺常见的方案，Translator++也是这么搞的吧，应该还是比较普适。

daiaji commented 4 months ago

光这样没法兼容旧的=写法。因为不光是读ini文件的问题，读ini只是默认规则，现在的用法基本是用户自己在界面里编辑正则。

旧写法长啥样？

satan53x commented 4 months ago

那可以界面加个下拉选项，原来的是INI，加个YAML。这样可以兼容之前的，然后新的只需要在main_extract调用新的解析函数就行。因为实际上在每个引擎脚本里参数都是处理好了的，不需要变。文件名本来也是在具体引擎脚本之外的main_extract里。

satan53x commented 4 months ago

旧写法长啥样？

就是ini的写法，10_search=^(.+?)$这种

daiaji commented 4 months ago

旧写法长啥样？

就是ini的写法，10_search=^(.+?)$这种

INI

：一种保存键值对数据的文本格式，只支持简单的单层结构。

是 Windows 系统采用的一种配置文件格式，但类 Unix 系统中也存在类似的文件格式，只是没有明确的标准。

语法

INI 格式的文本文件的扩展名为 .ini 。
用分号 ; 声明单行注释，有的解析器也支持用 # 。
每行声明一对键值对参数，用等号 = 连接 key 和 value 。
- key 、 value 都是字符串类型，有的解析器也能识别 true、false 等特殊值。
- key 、 value 前后的空格会被忽略，除非用双引号作为定界符包住。
用中括号 [ ] 声明一个段（Section）。
- 中括号内的字符串会作为段名，不会忽略空格。
- 每个段中， key 不能重复。

例

[Section 1]
id = 1
tips = Hello World

[MySQL]
runner = root
work_dir = /opt/mysql

没所谓吧，只是ini用=分割而已，yaml、json不是用:分割吗？还是说=有什么特殊的处理？ ini迁移到yaml也不费事。

daiaji commented 4 months ago

再说点鬼扯的东西的话，就是从可读性角度来说，输出的字典也能用yaml. yaml基本上能完全替代json，而还能注释，可读性也更好。可能只有依赖项目用的是json，其他的配置都能重写成yaml。况且用库读取json的元素到yaml的话，也不需要显式的转换。只是更多的项目使用json较多。是因为node.js的缘故吧。

有这个需求的时候，我也考虑过重写成json，不过本身json除了用得多就没有优点。我想到我也不应该在手写规则的时候被{}和,折磨，就算放弃了。再者json不支持注释。

satan53x commented 4 months ago

还是说=有什么特殊的处理？ ini迁移到yaml也不费事。

不是，是为了兼容之前的写法，因为一般都不是直接用默认正则，他们用SE的时候会把修改的正则自己保存在记事本里，方便之后查阅。（要用的时候直接粘贴到GUI界面里，并不会手动去改ini文件）所以ini写法不能弃用，而是额外加一个yaml的解析，两者共存，选哪个就用哪个解析。

daiaji commented 4 months ago

所以ini写法不能弃用，而是额外加一个yaml的解析，两者共存，选哪个就用哪个解析。

就是这个麻烦些。但是至少应该把默认规则迁移到yaml。以后再行弃用ini. 也算折中。

satan53x commented 4 months ago

现在的流程是： GUI窗口mainWindow解析engine.ini/reg.ini文件 -> 读取的数据部分作为参数，部分重组为字符串（正则部分） -> 用户可以在窗口编辑这个字符串，也就是自定义 -> 字符串发送到main_extract，进行ini解析 -> 解析后的参数发送到单个引擎脚本

daiaji commented 4 months ago

OK，我先去看看吧

daiaji commented 3 months ago

@satan53x extract_RPGMV和extract_JSON有用到contentSeparate吗？搜索了一下，这两个似乎都没用到contentSeparate的样子。

satan53x commented 3 months ago

extract_RPGMV和extract_JSON有用到contentSeparate吗？

没有用到，这俩都是直接按json解析的。 contentSeparate一般只用于bin二进制读取文件的方式。 txt方式是按换行符readlines()分割，也不会用到contentSeparate。

daiaji commented 3 months ago

extract_RPGMV和extract_JSON有用到contentSeparate吗？

没有用到，这俩都是直接按json解析的。 contentSeparate一般只用于bin二进制读取文件的方式。 txt方式是按换行符readlines()分割，也不会用到contentSeparate。

readFileData属于一个也没用到的状态？旧代码遗留是吧。

satan53x commented 3 months ago

你说配置里的嘛，那个好像是没用了，后来换成了直接检查模块函数readFileDataImp。

daiaji commented 3 months ago

你说配置里的嘛，那个好像是没用了，后来换成了直接检查模块函数readFileDataImp。

ParseVar类赋值了ignoreDecodeError = False initParseVar方法里的var.ignoreDecodeError = ExVar.ignoreDecodeError现在还管用吗？ reg.ini的ignoreDecodeError="1"又是啥情况？ignoreDecodeError不应该是个布尔值吗？配置里的ignoreDecodeError也是没用的状态？

satan53x commented 3 months ago

initParseVar方法里的var.ignoreDecodeError = ExVar.ignoreDecodeError现在还管用吗？

有用，var.ignoreDecodeError是最后使用的局部变量，ExVar.ignoreDecodeError是全局变量。 ignoreDecodeError最后是用if判定的，所以0和False等效，其他和True等效。

daiaji commented 3 months ago

最终敲定配置文件是这样。

Engine_RPGMV:
  file: json
  postfix: .json
  contentSeparate: ""
  regDic: "2"
  files:
    - name_patterns: ["Armors.json", "Weapons.json", "Items.json", "Skills.json"]
      sample:
        10_search: ^<root\s+>(.+?)$
        extractKey: [description, name]
    - name_patterns: ["Enemies.json", "MapInfos.json", "Classes.json"]
      sample:
        10_search: ^<root\s+>(.+?)$
        extractKey: [name]
    - name_patterns: ["*"]
      sample:
        00_skip: "^<.+?>$"
        01_skip: "^<(?!code)"
        10_search: '^<code102>([\S\s]+)$'
        15_search: '^<code401> ?(?P<name>[\S\s]+)：$'
        16_search: '^<code401>(?P<unfinish>[\S\s]+)$'
        20_search: '^<.+?>([^ -\[\]-~][\S\s]+)$'
        extractKey: [name, description, nickname]
Engine_RPGVX:
  file: json
  postfix: .json
  contentSeparate: ''
  regDic: '2'
  sample:
    00_skip: '^<.+?>$'
    01_skip: '^<(?!code)'
    10_search: '^<code102>([\S\s]+)$'
    15_search: '^<code401>\\C\[\d\]\[(?P<name>.+?)\]\\C\[\d\]$'
    16_search: '^<code401>(?P<unfinish>[\S\s]+)$'
    20_search: '^<.+?>([^ -\[\]-~][\S\s]+)$'
    extractKey: [none]
    #请先使用tools/RPGMakerVX文件夹下工具将rvdata转为json
    #Key:name,description,nickname,note,message1,message2,message3,message4
    #推荐Map里的对话文本和其他的json分开提取，用不同的正则方便处理

ini迁移到yaml搞定了。 https://github.com/daiaji/SExtractor/commit/c1e2a75132d16e16bf22d940e0bd0f450bb9b104

以下是实现你需求的完整步骤，包括具体的函数增加、修改和方案总结:

步骤：

解析 files 配置：
- 在 selectEngine 和 selectReg 函数中，解析 self.engineConfig[group].get('files') 或 self.regConfig[regName].get('files')，并生成 files_config 列表。
- 使用 yaml.load 函数解析 YAML 字符串，并使用 parse_files_data 函数处理解析后的数据。
修改 extractFile 函数：
- 在 extractFile 函数中，将 self.sampleBrowser.toPlainText() 转换为对象，并使用 parse_files_data 函数解析 files 配置。
- 根据 files_data 列表中 name_patterns 键的数量，创建对应数量的 mainExtract 线程。
- 为每个线程的 args 字典设置 name_patterns、exclude_patterns 和 regDic，并使用 files_data 中对应元素的 sample 字段作为 regDic。
修改 getFiles 函数：
- 将 getFiles 函数改造为 get_files 函数，并添加两个参数：name_patterns 和 exclude_patterns。
- 使用 fnmatch.fnmatch 函数匹配文件名模式，并根据 name_patterns 和 exclude_patterns 筛选文件。
修改 mainExtract 函数：
- 在 mainExtract 函数中，使用 get_files 函数来枚举文件，并传递 args["name_patterns"]、args["exclude_patterns"] 和 var.Postfix 作为参数。

新增函数：

def parse_files_data(self, config):
    """
    解析文件配置数据，生成文件数据列表。

    Args:
        config: 文件配置数据，可以是字典或列表。

    Returns:
        文件数据列表，每个元素包含文件名模式、样本配置和排除文件名模式列表。
    """

    def create_file_data(name_patterns, sample, exclude_patterns):
        """
        创建单个文件数据。

        Args:
            name_patterns: 文件名模式列表。
            exclude_patterns: 排除文件名模式列表。
            sample: 样本配置。

        Returns:
            单个文件数据字典。
        """
        return {
            "name_patterns": name_patterns,
            "exclude_patterns": list(exclude_patterns),
            "sample": sample,
        }

    file_data_list = []

    if isinstance(config, dict):
        file_data_list.append(create_file_data(["*"], config, set()))
    elif isinstance(config, list):
        all_name_patterns = []
        wildcard_configs = []
        for file_config in config:
            name_patterns = file_config.get("name_patterns", ["*"])
            exclude_indices = file_config.get("exclude_patterns", [])
            sample = file_config.get("sample", {})

            exclude_patterns = set()
            for idx in exclude_indices:
                if 0 <= idx < len(config):
                    exclude_patterns.update(config[idx].get("name_patterns", []))

            file_data = create_file_data(name_patterns, sample, exclude_patterns)
            file_data_list.append(file_data)
            all_name_patterns.extend(name_patterns)

            if name_patterns == ["*"]:
                wildcard_configs.append(file_data)

        for wildcard_config in wildcard_configs:
            wildcard_config["exclude_patterns"].update(all_name_patterns)

    return file_data_list

def get_files(dirpath, name_patterns, exclude_patterns, postfix=""):
    """
    枚举工作目录下的所有文件，并根据 name_patterns 和 exclude_patterns 进行筛选

    参数:
    dirpath (str): 目标目录路径
    name_patterns (list): 文件名匹配模式列表（使用通配符，如 '*.json'）
    exclude_patterns (list): 要排除的文件名列表（使用通配符）
    postfix (str): 要匹配的文件后缀（默认值为空）

    返回:
    list: 符合条件的文件名列表
    """
    try:
        directory = Path(dirpath)
        if not directory.is_dir():
            raise FileNotFoundError(f"{dirpath} is not a valid directory.")

        all_files = [file_path.name for file_path in directory.iterdir() if file_path.is_file()]
    except (FileNotFoundError, PermissionError) as e:
        print(f"Error accessing directory: {e}")
        return []

    matching_files = [
        file_name for file_name in all_files
        if any(fnmatch.fnmatch(file_name, pattern) for pattern in name_patterns)
        and not any(fnmatch.fnmatch(file_name, exclude) for exclude in exclude_patterns)
        and file_name.endswith(postfix)
    ]

    return matching_files

修改的函数：

# mainWindow.py
class MainWindow(QMainWindow, Ui_MainWindow):
    # ... 其他代码 ...

    def selectEngine(self, index):
        # ... 其他代码 ...

        # 显示示例
        engineName = self.engineNameBox.currentText()
        group = 'Engine_' + engineName
        value = self.engineConfig[group].get('files')
        if value is None:
            value = self.engineConfig[group].get('sample')
        if value:
            self.sampleBrowser.setText(self.yaml_to_string(value))
        else:
            self.sampleBrowser.setText('')

        # ... 其他代码 ...

    def selectReg(self, index):
        # ... 其他代码 ...

        # 显示示例
        regName = self.regNameBox.currentText()
        value = self.regConfig[regName].get('files')
        if value is None:
            value = self.regConfig[regName].get('sample')
        if value:
            self.sampleBrowser.setText(self.yaml_to_string(value))
        else:
            self.sampleBrowser.setText('')

        # ... 其他代码 ...

    def extractFile(self):
        # ... 其他代码 ...

        # 解析 files 配置
        files_data = self.parse_files_data(yaml.load(self.sampleBrowser.toPlainText()))

        # 创建多线程
        for i, file_data in enumerate(files_data):
            args = {
                'file':fileType,
                'workpath':self.mainDirEdit.text(),
                'engineName':engineName,
                'outputFormat':self.outputFileBox.currentIndex(),
                'outputPartMode':self.outputPartBox.currentIndex(),
                'nameList':self.nameListEdit.text(),
                "name_patterns": file_data["name_patterns"],
                "exclude_patterns": file_data["exclude_patterns"],
                "regDic": file_data["sample"],
                'outputFormatExtra':self.outputFileExtraBox.currentIndex() - 1,
                'encode': self.txtEncodeBox.currentText(),
                'print': self.getExtractPrintSetting(),
                'splitParaSep': self.splitSepEdit.text(),
                'maxCountPerLine': int(self.splitMaxEdit.text()),
            }
            self.thread = extractThread()
            self.thread.window = self
            self.thread.args = args
            self.thread.finished.connect(self.handleThreadFinished)
            self.thread.start()

        # ... 其他代码 ...

    def parse_files_data(self, config):
        # ... parse_files_data 函数代码 ...

# ... 其他代码 ...

def mainExtract(args, parseImp, initDone=None):
    if len(args) < 4:
        printError("main_extract参数错误", args)
        return
    #showMessage("开始处理...")
    path = args['workpath']
    var.workpath = path
    if initArgs(args) != 0: return
    if initDone: initDone()
    #print(path)
    var.partMode = 0
    var.outputDir = 'ctrl'
    var.inputDir = 'ctrl'
    #print('---------------------------------')
    if os.path.isdir(path):
        #print(var.workpath)
        createFolder()
        var.curIO = var.io
        readFormat() #读入译文
        # 修改 getFiles 调用
        files = get_files(
            var.workpath,
            args["name_patterns"],  # 使用 args["name"] 作为 name_patterns
            args["exclude_patterns"],  # 使用 args["exclude_patterns"] 作为 exclude_patterns
            var.Postfix,  # 使用 var.Postfix 作为 postfix
        )
        fmt = var.curIO.outputFormat
        if fmt in [2, 6, 7]: #属于列表格式
            needReverse = True
            files = list(reversed(files))
        else:
            needReverse = False
        for i, name in enumerate(files):
            showProgress(i, len(files))
            var.filename = name
            printDebug('读取文件:', var.filename)
            parseImp()
            keepAllOrig(needReverse)
            #break #测试
        showProgress(100)
        printInfo('读取文件数:', var.inputCount)
        writeFormat()
        printInfo('新建文件数:', var.outputCount)
        var.curIO = var.ioExtra
        writeFormat()
        writeCutoffDic()
    else:
        printError('未找到主目录')
    extractDone()

方案总结：

通过以上修改，你成功地实现了对多个文件进行处理，并且每次处理只针对特定文件名的文件。你的方案非常合理，代码实现也清晰易懂。

其他建议：

你可以考虑将 parse_files_data 函数封装到一个单独的模块中，以便更好地组织代码。
你可以考虑添加一些错误处理机制，例如在解析 files 配置时，如果遇到错误，则应该提示用户并停止操作。
你可以考虑添加一些日志记录功能，以便更好地跟踪程序的运行状态。

希望我的解释能够帮助你理解你的方案，并顺利实现你的需求！

和LLM搓了按文件名分批提取文本的部分实现。还需要修改JSON合并？现在来说每次创建提取线程都会导致词典文件被覆盖吧？然后就是多文件的进度条。 ini的兼容暂时还没开始搞，应该还有一些代码要抽象化。

satan53x commented 3 months ago

extractKey: [description, name]

参数你可以不用转换格式，还是用之前的单个字符串，和之前=右边是单个字符串然后进行解析一样，解析函数就不需要改。

extractKey: 'description,name'

比如extraData参数这种是分引擎的，是引擎脚本自己解析字符串，有的解析完是列表，有的是数字，有的还是单字符串。比如extractKey参数现在是RPGMV引擎用了，RPGMV.py里是按字符串解析分离出列表的，如果一开始就是列表，你还得改RPGMV.py。那样的话可能每个读取列表参数的引擎都得去改了，比如CSV。

satan53x commented 3 months ago

还需要修改JSON合并？

不需要修改JSON合并吧。

  files:
    - name_patterns: ["Armors.json", "Weapons.json", "Items.json", "Skills.json"]
      sample:
        10_search: ^<root\s+>(.+?)$
        extractKey: [description, name]

这个name_patterns可以写到sample里边，这样比较统一格式，而且可以少一个层级。现在是files-file-sample和sample两种，可以改成samples-sample和sample，再把name_patterns作为sample下的一个参数就行。（改为fileMatch或filenameMatch之类的，以免和文本的name混淆）本来你现在还是通过args传入的name_patterns，不如直接写到var里作为参数。

  samples:
    - 10_search: ^<root\s+>(.+?)$
      extractKey: 'description,name'
      filenameMatch: '^(Armors|Weapons|Items|Skills)'

而且你都用通配符了，不如直接正则，不然用户还得区分通配符和正则，不够统一。

daiaji commented 3 months ago

还需要修改JSON合并？

不需要修改JSON合并吧。
  files:
    - name_patterns: ["Armors.json", "Weapons.json", "Items.json", "Skills.json"]
      sample:
        10_search: ^<root\s+>(.+?)$
        extractKey: [description, name]
这个name_patterns可以写到sample里边，这样比较统一格式，而且可以少一个层级。现在是files-file-sample和sample两种，可以改成samples-sample和sample，再把name_patterns作为sample下的一个参数就行。（改为fileMatch或filenameMatch之类的，以免和文本的name混淆）本来你现在还是通过args传入的name_patterns，不如直接写到var里作为参数。
  samples:
    - 10_search: ^<root\s+>(.+?)$
      extractKey: 'description,name'
      filenameMatch: '^(Armors|Weapons|Items|Skills)'
而且你都用通配符了，不如直接正则，不然用户还得区分通配符和正则，不够统一。

减少层级挺好的。

而且你都用通配符了，不如直接正则，不然用户还得区分通配符和正则，不够统一。

我当时感觉处理文件名通配就差不多了，然后我搞到一半，意识到既然都用通配了不如直接正则。主要是也只是在测试，能跑就行，就没管了。

如果不考虑原先选择threading.local带来的全局变量半残废，和不知道哪里就数据竞争了。代码其实功能上是能跑的。但很可惜数据竞争。

都这个吊样了。直接多线程multiprocessing走起算了。还能用全局变量。想了想，现在就是把mainExtract方法那一堆变量初始化塞MainWindow类里初始化掉，然后拷贝初始化好的对象到子进程。能复用更多旧代码，multiprocessing.Manager也提供了进程安全的进程间对象共享。比threading.local还是方便很多。而且multiprocessing也更能利用现代CPU的多线程能力吧。

satan53x commented 3 months ago

想了想，现在就是把mainExtract方法那一堆变量初始化塞MainWindow类里初始化掉，然后拷贝初始化好的对象到子进程。而且multiprocessing也更能利用现代CPU的多线程能力吧。

不需要吧，main_extract里边变量都是函数内的局部变量，最后数据都是存到ExVar里边的。而且并不需要多线程，反正提取一般就几秒的事，多线程到一秒内也没啥好处。我写这个基本不管运行效率，毕竟大头运行时间还是在AI翻译那边，这边快慢无所谓。

daiaji commented 3 months ago

想了想，现在就是把mainExtract方法那一堆变量初始化塞MainWindow类里初始化掉，然后拷贝初始化好的对象到子进程。而且multiprocessing也更能利用现代CPU的多线程能力吧。

不需要吧，main_extract里边变量都是函数内的局部变量，最后数据都是存到ExVar里边的。而且并不需要多线程，反正提取一般就几秒的事，多线程到一秒内也没啥好处。我写这个基本不管运行效率，毕竟大头运行时间还是在AI翻译那边，这边快慢无所谓。

主要数据竞争是在发生在writeFormat（其实全局变量都竞争），现在是把那几个变量转储到硬盘上吧，当时想要不就是依次运行，但writeFormat是覆盖字典吧，与其改成在硬盘上向字典文件累加，不如向变量累加，然后将变量一次性转储到硬盘，依次运行似乎还麻烦点，同时运行如果做好线程安全的话，也不用管那么多东西。只是当时用多线程，线程隔离搞得很麻烦，多进程的话，反而由于隔离性更好，变得没有那么麻烦，要改的东西更少。

daiaji commented 3 months ago

进度更新，感想就是能不用threading还是尽量别用threading，threading线程隔离限制太多了，multiprocessing提供的独立内存空间就爽多了，而且子进程创建时还会继承父进程的内存空间，创建子进程时，全局变量一并继承过来了，直接在父进程初始化变量就行。 multiprocessing还能更好的利用CPU。除了IO密集型，threading似乎真的就没啥场景比multiprocessing更好，折腾数据竞争真是绝了。

字典提取看起来没啥问题，等我翻译完后，再写回原文试试。

daiaji commented 3 months ago

Skills.json 蛮奇怪的。我用这个自定义规则从Skills.json提取文本。

10_search=^(.+)$
extractKey=description

id：325的description就拿不到。

    {
        "id": 325,
        "animationId": 0,
        "damage": {
            "critical": false,
            "elementId": 0,
            "formula": "0",
            "type": 0,
            "variance": 20
        },
        "description": "On the next day, after reading a book, starting Willpower is restored by 1\nfor every 5 \\c[2]Rose Relationship (\\V[303])\\c[0].",
        "effects": [],
        "hitType": 0,
        "iconIndex": 233,
        "message1": "",
        "message2": "",
        "mpCost": 0,
        "name": "Rose Bond",
        "note": "",
        "occasion": 3,
        "repeats": 1,
        "requiredWtypeId1": 0,
        "requiredWtypeId2": 0,
        "scope": 1,
        "speed": 0,
        "stypeId": 3,
        "successRate": 100,
        "tpCost": 0,
        "tpGain": 0,
        "messageType": 1
    }

正则应该没限制了吧？我以为是我修改的问题，但原版好像也一样。只能拿到这些文本，咋回事啊？ transDic.output.json

satan53x commented 3 months ago

正则不对，.默认是不包含换行符的（不使用正则修饰符的时候），你文本里有\n，一般要用[\s\S]。 JSON和TXT不一样，json里显示\n就是单字符，txt里显示\n其实是\\n。

daiaji commented 3 months ago

Map005.json 如何命中In order to kill Richard, I first have to find him. But I have no leads at all, where do I even start...?

10_search=([\S\s]+)
extractKey=description,name,parameters

似乎不行……

satan53x commented 3 months ago

code需要配置是否提取，默认的命令配置表里没有这个"code": 357。需要自己在extract_RPGMV.py脚本的EVENT_COMMAND_CODES变量里加上配置。

daiaji commented 3 months ago

code需要配置是否提取，默认的命令配置表里没有这个"code": 357。需要自己在extract_RPGMV.py脚本的EVENT_COMMAND_CODES变量里加上配置。

OK！

daiaji commented 3 months ago

code需要配置是否提取，默认的命令配置表里没有这个"code": 357。需要自己在extract_RPGMV.py脚本的EVENT_COMMAND_CODES变量里加上配置。 EVENT_COMMAND_CODES修改了，用的规则是

10_search: '^([\S\s]+)$'
extractKey: [description, name, parameters, title]

输出有<paramet>,符合预期吗？

"<code657>questID = 19": "<code657>questID = 19",
"<paramet>Confer with Desmond": "<paramet>Confer with Desmond",
"<code657>description = Confer with Desmond": "<code657>description = Confer with Desmond",
"<code657>questID = 21": "<code657>questID = 21",
"<code657>objectiveID = 4": "<code657>objectiveID = 4",
"<paramet>Obtain the newly produced form from the merchant.": "<paramet>Obtain the newly produced form from the merchant.",
"<code657>description = Obtain the newly produced form from the merch…": "<code657>description = Obtain the newly produced form from the merch…",
"<code657>questID = 25": "<code657>questID = 25",
"<paramet>Return to the Alchemist.": "<paramet>Return to the Alchemist.",
"<code657>description = Return to the Alchemist.": "<code657>description = Return to the Alchemist.",

satan53x commented 3 months ago

输出有<paramet>,符合预期吗？

应该是没问题的，<>里是父节点名字，而且统一了长度，paramet就是parameters的简写。（只有code标签的做了特殊处理，后边的文本其实是它父节点的兄弟节点）

 "<paramet>Obtain the newly produced form from the merchant.": "<paramet>Obtain the newly produced form from the merchant.",

Obtain这个字符串本身的key应该是description，父节点名字才是parameters。

daiaji commented 3 months ago

还需要修改JSON合并？

不需要修改JSON合并吧。
  files:
    - name_patterns: ["Armors.json", "Weapons.json", "Items.json", "Skills.json"]
      sample:
        10_search: ^<root\s+>(.+?)$
        extractKey: [description, name]
这个name_patterns可以写到sample里边，这样比较统一格式，而且可以少一个层级。现在是files-file-sample和sample两种，可以改成samples-sample和sample，再把name_patterns作为sample下的一个参数就行。（改为fileMatch或filenameMatch之类的，以免和文本的name混淆）本来你现在还是通过args传入的name_patterns，不如直接写到var里作为参数。
  samples:
    - 10_search: ^<root\s+>(.+?)$
      extractKey: 'description,name'
      filenameMatch: '^(Armors|Weapons|Items|Skills)'
而且你都用通配符了，不如直接正则，不然用户还得区分通配符和正则，不够统一。

filenameMatch支持正则的情况下，exclude_patterns应该就没用了吧，处理文件名的场景应该也不会有这么高的复杂度吧。最多就是添加默认对.*的特殊处理。

gpt4o：留着吧，用过于复杂的正则匹配文件名只会让更痛苦。

daiaji commented 3 months ago

输出有<paramet>,符合预期吗？

应该是没问题的，<>里是父节点名字，而且统一了长度，paramet就是parameters的简写。（只有code标签的做了特殊处理，后边的文本其实是它父节点的兄弟节点）
 "<paramet>Obtain the newly produced form from the merchant.": "<paramet>Obtain the newly produced form from the merchant.",
Obtain这个字符串本身的key应该是description，父节点名字才是parameters。

displayName键也无法提取吗？ Map028.json

satan53x commented 3 months ago

10_search=^<root\s+>(.+?)$
extractKey=displayName

不是能提取到吗？

[
  {
    "message": "クロス村 ~宿屋1F酒場~"
  }
]

daiaji commented 3 months ago

10_search=^<root\s+>(.+?)$
extractKey=displayName

不是能提取到吗？

[
  {
    "message": "クロス村 ~宿屋1F酒場~"
  }
]

睁眼瞎了，翻了一下，确实提取到了。🙃

satan53x / SExtractor

Renpy提取报错。 #87

改进建议

修改后的配置文件示例

Python 代码示例

方案优势

结论

INI

语法

例