hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Closes #10 Adds a public API to Patapsco #18

Closed cash closed 2 years ago

cash commented 2 years ago

@eugene-yang The first commit lets users define a reranker outside of Patapsco and use it in an experiment:

import copy
import logging
import random

import patapsco

LOGGER = logging.getLogger(__name__)

class CashReranker(patapsco.Reranker):
    def process(self, results):
        print("Cash reranker!")
        new_results = copy.deepcopy(results.results)
        random.shuffle(new_results)
        return patapsco.Results(results.query, results.doc_lang, 'CashReranker', new_results)

patapsco.RerankFactory.register('cash', CashReranker)
runner = patapsco.Runner("cash.yml")
runner.run()

Logging does not work as expected since the namespace is not under patapsco. I probably need to add a utility method to get a logger from Patapsco.

This also doesn't support customized text processing yet.

cash commented 2 years ago

@eugene-yang Here is the updated code that works with this branch:

import copy
import random

import patapsco

config = {
    "run": {
        "name": "Cash's reranker"
    },
    "documents": {
        "input": {
            "format": "json",
            "lang": "eng",
            "encoding": "utf8",
            "path": "samples/data/eng_mini_docs.jsonl",
        },
        "process": {
            "normalize": {
                "lowercase": True,
            },
            "tokenize": "whitespace",
            "stem": "porter"
        },
        "comment": "Mini English dataset",
    },
    "database": {
        "name": "sqlite"
    },
    "index": {
        "name": "lucene"
    },
    "topics": {
        "input": {
            "format": "json",
            "lang": "eng",
            "source": "original",
            "encoding": "utf8",
            "path": "samples/data/eng_mini_topics.jsonl"
        },
        "fields": "title"
    },
    "queries": {
        "process": {
            "normalize": {
                "lowercase": True,
            },
            "tokenize": "whitespace",
            "stem": "porter"
        }
    },
    "retrieve": {
        "name": "bm25",
        "number": 5
    },
    "rerank": {
        "name": "cash"
    },
    "score": {
        "input": {
            "path": "samples/data/eng_mini_qrels"
        }
    }
}

class CashReranker(patapsco.Reranker):
    LOGGER = patapsco.get_logger("cash")

    def process(self, results):
        self.LOGGER.info("Cash reranker!")
        new_results = copy.deepcopy(results.results)
        random.shuffle(new_results)
        return patapsco.Results(results.query, results.doc_lang, 'CashReranker', new_results)

patapsco.RerankFactory.register('cash', CashReranker)
runner = patapsco.Runner(config)
runner.run()
cash commented 2 years ago

I'm going to merge this in. We can continue to building out the public API this fall.

eugene-yang commented 2 years ago

Looks like the logger is preventing patapsco from running multiple times in one python session.

In run.py, the initialization always adds new handlers to the logger. One StreamHandler will be added to the logger and the second one in the list would become a FileHandler after the first time running it.

cash commented 2 years ago

@eugene-yang okay, looking into that.

cash commented 2 years ago

@eugene-yang fixed this is master using #21. Let me know if you run into any other issues when using it as a library