tira-io / tira

The source code for the TIRA Shared Task Platform
https://www.tira.io/
MIT License
13 stars 9 forks source link

Maintenance Job with Access to ir_datasets #356

Open mam10eks opened 1 year ago

mam10eks commented 1 year ago

The goal of this ticket is to allow jobs with access to ir_datasets.

mam10eks commented 1 year ago

If we have this ticket, we can implement this:

I think I would now also favor such multi-step jobs for all re-ranking approaches in information retrieval experiments.

My line of thinking is the following:

    For each IR benchmark, we add two datasets in TIRA:
        Full-Rank: software has access to the complete corpus
        Re-Rank: a "pseudo dataset" where a software has access to to-be-re-ranked query document pairs

For the re-ranking scenario, one must further select which run one wants to re-rank (out of all own and public runs).
If there is an official re-ranking run (e.g., for MS MARCO), we should define them as default.
For all non-default runs, I would suggest the following:

    Users might select (some) public run of the dataset as the to-be-re-ranked run
    TIRA automatically wraps this run into an "ir-datasets-loader" job that is the previous stage of the job
        I.e., if the run was never used before, the run is transformed by the "ir-datasets-loader" job into the "standard" to-be-re-ranked query document pair format
        If the run was used before, the standard multi-step-thing described above kicks in so that the job itself is not executed again
    The job itself then uses the to-be-re-ranked query document pairs as "pseudo-input" (i.e., directly merged with the original input)

This has the advantage, that we can provide a set "default" runs to be re-ranked with appropriate documentation (e.g., BM25, the judgment pool with corresponding warnings that this usually tends to overestimate the effectiveness, etc) but also all other runs (and even very costly runs, as mono/duoT5 etc.) can be used directly without adoption. We have full flexibility per dataset and can show this transparent in the leaderboards. I think this way, we provide a big benefit, because users can even built upon very costly systems in a fast and cost-efficient way. We should combine this and run some of the costly but important pipelines on all datasets (and make one or two A100 GPUs temporarily available in TIRA so that we can even run the largest versions?).
github-actions[bot] commented 10 months ago

This issue has been marked stale because it has been open 60 days with no activity.