Open seanmacavaney opened 5 months ago
Thanks Sean, this is interesting concept.
TerrierIndex is a bit of a complicated decision, as you know.
Could have a branch of pyterrier_pisa using this functionality so we can roadtest the API? A guide on how to add a new artifact?
The new pyterrier-quality repo has an example.
The key bits are:
Some things I'm still considering:
_try_load
stuff. Artifacts need the metadata to be loaded. This will be the primary case, which simplifies the implementation of a new artifact.
Artifact.from_hf(dataset_id)
which just calls Artifact.from_hf(f'hf:{dataset_id}')
artifact
branch now on pyterrier-pisa: https://github.com/terrierteam/pyterrier_pisa/tree/artifact
And on pyterrier-dr: https://github.com/terrierteam/pyterrier_dr/tree/artifact
I'm not sure what the entry_point stuff is for. Can you explain it simply? Is that just a code discovery mechanism for Python?
What use case does an Artefact address? Is it so I dont know the class that I am looking for to get an index or something, I can still load a factory object? But if I dont know its class, I dont know what (factory) methods it supports.
Entry points act as a registry of all the artifacts installed. Since they're metadata about the package itself, they do not involve loading any modules at runtime to establish what's registered.
Here's a short document on the use case for artifacts: https://gist.github.com/seanmacavaney/ceac1b5eacaac4b072caa69986089ff4
Beyond the use cases outlined in the document, as you mention, it also simplifies the loading of artifacts. This is similar to how AutoModel
simplifies loading models in huggingface. Oftentimes you're already specifying the name of what you're loading in the ID itself, so it's annoying and redundant to write it out again. For instance:
import pyterrier as pt
pt.init()
from pyterrier_pisa import PisaIndex
index = PisaIndex.from_hf('pyterrier/msmarco-passage.pisa')
# vs
import pyterrier as pt
pt.init()
index = pt.Artifact.from_hf('pyterrier/msmarco-passage.pisa')
The metadata file provided by the artifact specification also gives a hint about what package you need to install to load the artifact. So in the above example, if pyterrier-pisa
isn't installed, it could give an error message saying that you need to install this package to load the index. (This isn't implemented yet, but the metadata is there.)
But if I dont know its class, I dont know what (factory) methods it supports.
This is also true with huggingface's AutoModel. You can always do help(index)
to get documentation once you have an instance of an object, but when you only have an identifier, it might be challenging to find the right artifact class to load it.
A prototype of the artifact API is in pyterrier-alpha. Integrated with extension packages:
Still to integrate:
QualCache
CorpusGraph
This is indeed a very cool concept, it would maybe also be cool if we could load artifact-results such as runs from TIRA? Could maybe be also prefixed similar to irds:...
or the hf:...
example from above?
Sounds reasonable! The idea would be that it would detect if it was loading a run file (or similar) and return it as a dataframe? We can experiment with this a bit on the implementation in pyterrier-alpha
.
yes, I think this style of magic (we likely should introduce a verbose flag :)) would be quite cool, automatically detecting that an ouptut is a run file should be no problem, as in tira they are always expected to produce a run.txt, which could be easily captured in combination with the scoped prefix.
For results, it might make more sense to have a special URI-style format for loading with pt.io.read_results
? E.g., pt.io.read_results('tirex:<task>/<team>/<approach>/<dataset>')
?
Yes, sounds very good.
I think a cool way could also be when one could directly pass a dataset id from ir_datasets, i.e., that the mapping from dataset-id to tira task is done internally. The dataset id might contain /
characters, but I think this would be no problem as this would imply an hierarchical structure of the task, which is I think a valid viewpoint.
E.g., if we have the irds id clueweb12/touche-2020-task-2
an call would look like:
pt.io.read_results('tirex:clueweb12/touche-2020-task-2/<team>/<approach>)
We could maybe also think about listing of results. E.g., if I call something like:
pt.io.read_results('tirex:clueweb12/touche-2020-task-2/<team>)
it could print out all public approaches by the team and then fail, or if I call:
pt.io.read_results('tirex:clueweb12/touche-2020-task-2/)
It could print all public approaches and then fail.
I would start to play a bit around in pyterrier-alpha.
Cool, I have a first rough prototype (did not require no change in the pyterrier-alpha codebase, and only minor additions to the tira client) so that this test case works:
https://github.com/mam10eks/pyterrier-alpha/blob/main/tests/test_artifacts_from_tira.py
In principle (plus/minus potentially changed design decisions and documentation and more unit tests), this is it :)
As a heads up -- I've replaced this branch with a version taken from alpha
The artifact-old branch records the state before the force push.
I think current commit omits the TerrierIndex artefact
Good catch! I had forgotten that it was done in the old one.
Thanks! From what I can tell, it looks like the correct return type annotations for the context managers (are described here) is Generator[X, None, None]
. (It's a bummer because the annotations make it look like it's used as a generator instead of how it's actually used as a context manager :/.)
WIP
Example: