terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
412 stars 65 forks source link

Artifact API #436

Open seanmacavaney opened 5 months ago

seanmacavaney commented 5 months ago

WIP

Example:

import pyterrier as pt ; pt.init()
index = pt.artifact.from_url('hf:macavaney/msmarco-passage.terrier')
# TerrierIndex('/Users/sean/.pyterrier/artifacts/7ead118630437940852142386f67ab62123a6ce372bb4b8cf12b06a76c8ccc25' <from 'https://huggingface.co/datasets/macavaney/msmarco-passage.terrier/resolve/main/artifact.tar.lz4'>)
index.bm25() # -> TerrierRetrieve

# maintain support for centralized from_dataset
index = pt.Artifact.from_dataset('msmarco_document', 'terrier_stemmed') # maps to hf:macavaney/pyterrier-from-dataset@msmarco_document.terrier_stemmed
cmacdonald commented 5 months ago

Thanks Sean, this is interesting concept.

TerrierIndex is a bit of a complicated decision, as you know.

Could have a branch of pyterrier_pisa using this functionality so we can roadtest the API? A guide on how to add a new artifact?

seanmacavaney commented 5 months ago

The new pyterrier-quality repo has an example.

The key bits are:

Some things I'm still considering:

seanmacavaney commented 5 months ago

artifact branch now on pyterrier-pisa: https://github.com/terrierteam/pyterrier_pisa/tree/artifact

seanmacavaney commented 5 months ago

And on pyterrier-dr: https://github.com/terrierteam/pyterrier_dr/tree/artifact

cmacdonald commented 5 months ago

I'm not sure what the entry_point stuff is for. Can you explain it simply? Is that just a code discovery mechanism for Python?

What use case does an Artefact address? Is it so I dont know the class that I am looking for to get an index or something, I can still load a factory object? But if I dont know its class, I dont know what (factory) methods it supports.

seanmacavaney commented 5 months ago

Entry points act as a registry of all the artifacts installed. Since they're metadata about the package itself, they do not involve loading any modules at runtime to establish what's registered.

Here's a short document on the use case for artifacts: https://gist.github.com/seanmacavaney/ceac1b5eacaac4b072caa69986089ff4

Beyond the use cases outlined in the document, as you mention, it also simplifies the loading of artifacts. This is similar to how AutoModel simplifies loading models in huggingface. Oftentimes you're already specifying the name of what you're loading in the ID itself, so it's annoying and redundant to write it out again. For instance:

import pyterrier as pt
pt.init()
from pyterrier_pisa import PisaIndex
index = PisaIndex.from_hf('pyterrier/msmarco-passage.pisa')
# vs
import pyterrier as pt
pt.init()
index = pt.Artifact.from_hf('pyterrier/msmarco-passage.pisa')

The metadata file provided by the artifact specification also gives a hint about what package you need to install to load the artifact. So in the above example, if pyterrier-pisa isn't installed, it could give an error message saying that you need to install this package to load the index. (This isn't implemented yet, but the metadata is there.)

But if I dont know its class, I dont know what (factory) methods it supports.

This is also true with huggingface's AutoModel. You can always do help(index) to get documentation once you have an instance of an object, but when you only have an identifier, it might be challenging to find the right artifact class to load it.

seanmacavaney commented 2 months ago

A prototype of the artifact API is in pyterrier-alpha. Integrated with extension packages:

Still to integrate:

mam10eks commented 1 month ago

This is indeed a very cool concept, it would maybe also be cool if we could load artifact-results such as runs from TIRA? Could maybe be also prefixed similar to irds:... or the hf:... example from above?

seanmacavaney commented 1 month ago

Sounds reasonable! The idea would be that it would detect if it was loading a run file (or similar) and return it as a dataframe? We can experiment with this a bit on the implementation in pyterrier-alpha.

mam10eks commented 1 month ago

yes, I think this style of magic (we likely should introduce a verbose flag :)) would be quite cool, automatically detecting that an ouptut is a run file should be no problem, as in tira they are always expected to produce a run.txt, which could be easily captured in combination with the scoped prefix.

seanmacavaney commented 1 month ago

For results, it might make more sense to have a special URI-style format for loading with pt.io.read_results? E.g., pt.io.read_results('tirex:<task>/<team>/<approach>/<dataset>')?

mam10eks commented 1 month ago

Yes, sounds very good.

I think a cool way could also be when one could directly pass a dataset id from ir_datasets, i.e., that the mapping from dataset-id to tira task is done internally. The dataset id might contain / characters, but I think this would be no problem as this would imply an hierarchical structure of the task, which is I think a valid viewpoint.

E.g., if we have the irds id clueweb12/touche-2020-task-2 an call would look like:

We could maybe also think about listing of results. E.g., if I call something like:

it could print out all public approaches by the team and then fail, or if I call:

It could print all public approaches and then fail.

mam10eks commented 1 month ago

I would start to play a bit around in pyterrier-alpha.

mam10eks commented 1 month ago

Cool, I have a first rough prototype (did not require no change in the pyterrier-alpha codebase, and only minor additions to the tira client) so that this test case works:

https://github.com/mam10eks/pyterrier-alpha/blob/main/tests/test_artifacts_from_tira.py

In principle (plus/minus potentially changed design decisions and documentation and more unit tests), this is it :)

seanmacavaney commented 1 week ago

As a heads up -- I've replaced this branch with a version taken from alpha

The artifact-old branch records the state before the force push.

cmacdonald commented 1 week ago

I think current commit omits the TerrierIndex artefact

seanmacavaney commented 1 week ago

Good catch! I had forgotten that it was done in the old one.

seanmacavaney commented 1 week ago

Thanks! From what I can tell, it looks like the correct return type annotations for the context managers (are described here) is Generator[X, None, None]. (It's a bummer because the annotations make it look like it's used as a generator instead of how it's actually used as a context manager :/.)