hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Closes #3 updates to new topic format #8

Closed cash closed 2 years ago

cash commented 2 years ago

First breaking commit as we work toward the 1.0.0 release

eugene-yang commented 2 years ago

Looks like if we want to use English/Original queries, patapsco checks whether English is supported with qrel. I don't think it is the intended behavior as the language of the query is not the same as the language with qrel supported.

With a topic like this,

{"topic_id": "5", "languages_with_qrels": ["zho"], "topics": [{"lang": "eng", "source": "original", ... 

Currently, patapsco with this setting will not grab topic 5. But this is the essential config for PSQ.

    "topics": {
        "input": {
            "format": "json",
            "lang": "eng",
            "source": "original",
            "encoding": "utf8",
            "path": "../data/dev.topics.v1-0.jsonl"
        },
        "fields": "title"
    },

If the topic is changed to this, it will pass

{"topic_id": "5", "languages_with_qrels": ["zho", "eng"], "topics": [{"lang": "eng", "source": "original", ... 
eugene-yang commented 2 years ago

Looks like when using a partial run for retrieval, it checks whether the language matches the one in .lang. But I don't think the file .lang is stored when a run is finished.

cash commented 2 years ago

@eugene-yang I'll check on the qrels. It should depend on the documents and not the topics.

The .lang file is written to the lucene directory.

eugene-yang commented 2 years ago

Oops got it. :)

cash commented 2 years ago

@eugene-yang I hacked the master branch to turn off the qrel check. This is going to require a larger refactor to fix correctly.

eugene-yang commented 2 years ago

I used the filter_lang to bypass that in the demo notebook but I don't think that is the right way to do things.

cash commented 2 years ago

I think that is the best way for now, but I broke that when I removed the filtering. Let me know if you need that put back in master.