hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
https://pypi.org/project/opuscleaner/
46 stars 13 forks source link

run.py fails on with exported json when unpacking filenames #65

Closed XapaJIaMnu closed 1 year ago

XapaJIaMnu commented 1 year ago

Example generated json:

{
    "version": 1,
    "files": [
        [
            "ELRC-3056-wikipedia_health-v1.en-zh.en.gz",
            "ELRC-3056-wikipedia_health-v1.en-zh.zh.gz"
        ]
    ],
    "filters": [
        {
            "id": 6,
            "filter": "remove_empty_lines",
            "language": null,
            "parameters": {}
        },
        {
            "id": 9,
            "filter": "segment_chinese",
            "language": "zh",
            "parameters": {}
        },
        {
            "id": 11,
            "filter": "alpha_ratio",
            "language": null,
            "parameters": {
                "LANG1": "en",
                "LANG2": "zh",
                "SRCWORDRAT": 0.4,
                "TRGWORDRAT": 0.4,
                "SRCALPHARAT": 0.5,
                "TRGALPHARAT": "0.2",
                "DEBUG": false
            }
        },
        {
            "id": 13,
            "filter": "desegment_chinese",
            "language": "zh",
            "parameters": {}
        }
    ]
}

When doing ./run.py filters.yaml -b data/train-parts/ I get:

Traceback (most recent call last):
  File "./run.py", line 543, in <module>
    main(sys.argv[1:])
  File "./run.py", line 472, in main
    languages: List[str] = args.languages if args.input else [filename.rsplit('.', 2)[1] for filename in pipeline_config['files']]
  File "./run.py", line 472, in <listcomp>
    languages: List[str] = args.languages if args.input else [filename.rsplit('.', 2)[1] for filename in pipeline_config['files']]
AttributeError: 'list' object has no attribute 'rsplit

The issue is that the datasets are packed in a double array, but only single unpacking is done. [['ELRC-3056-wikipedia_health-v1.en-zh.en.gz', 'ELRC-3056-wikipedia_health-v1.en-zh.zh.gz']]

That's fairly easy to fix, but not sure what is the desired behavior, hence opening the bug report.

jelmervdl commented 1 year ago

How did you generate that json? The schema is described in the FilterPipeline class, and that says List[str]. I don't understand how you'd get a list in a list for files:

class FilterPipeline(BaseModel):
    version: Literal[1]
    files: List[str]
    filters: List[FilterStep]

The only place files is populated is this bit:

def make_pipeline(name, filters=[]):
    columns = list_datasets(DATA_PATH)[name]
    return FilterPipeline(
        version=1,
        files=[file.name
            for _, file in
            sorted(columns.items(), key=lambda pair: pair[0])
        ],
        filters=filters
    )
XapaJIaMnu commented 1 year ago

I just exported it via export json on the gui

jelmervdl commented 1 year ago

Oooh you're right! That's a UI bug. Should be fixed now.