Open mam10eks opened 2 months ago
I.e., see a first test here: https://github.com/mam10eks/pyterrier-alpha/blob/main/tests/test_artifacts_from_tira.py
If we have a pyterrier artifact tira://clueweb12/touche-2020-task-2/fschlatt/sparse-cross-encoder-4-512
resolved via the name: we point this to https://data.tira.io/clueweb12/touche-2020-task-2/fschlatt/sparse-cross-encoder-4-512
etc.
@TheMrSheldon, @potthast: One thing that just came to my mind:
when we implement things like this (pt
being pyterrier):
pt.io.read_results('tira:clueweb12/touche-2020-task-2/<team>/<approach>)
pt.io.read_results('tira:clueweb12/touche-2020-task-2/<team>)
pt.io.read_results('tira:clueweb12/touche-2020-task-2/)
we could render the corresponding pages, i.e.,
https://data.tira.io/clueweb12/touche-2020-task-2/<team>/<approach>
https://data.tira.io/clueweb12/touche-2020-task-2/<team>
https://data.tira.io/clueweb12/touche-2020-task-2/
where all the endpoints directly allow to browse, i.e., https://data.tira.io/clueweb12/touche-2020-task-2/
would show all teams with all approaches, etc. This would be especially helpful in the case where something does not exist, as the error message shown to users could show the next fallback. I.e., if pt.io.read_results('tira:clueweb12/touche-2020-task-2/)
does not exist, the error message could point users to the next higher level for browsing, i.e., https://data.tira.io/clueweb12/
.
I think this could be a first use-case for the new V1 REST API. I.e., I think it would make sense to move the metadata on which approaches have been archived to Zenodo to the tira database (currently I had this in the code of the python client to remove dependencies to the live system, but data.tira.io will be statically hosted, so this should be no problem). And when we have this in the tira database, we can make the endpoint on accessing what is archived where public (as it is public anyway) and traverse this endpoint during the build of the static https://data.tira.io
.
What do you think?
This would also allow things like pt.io.read_results('tira:clueweb12/touche-2020-task-2/<team>/<approach>', verbose=True)
to point to https://data.tira.io
, which would be very cool, e.g., you show this to someone, where this tira:...
is a bit of a magic string, but as soon as someone wants to have more knowledge on this, we add this verbose=True
flag and have a very good explainability by default. especially, because we would have the things like "what is in this artifact", etc. by default, as we already store all the metadata on what is contained in a run (I mean our browser that shows "your run output contains files x, y, z").
I think this would combine very well.
On that note: if we use data.tira.io
also to integrate visualizations (which I would be a big fan of), e.g., via DiffIR, we should use ChatNoir as links for the full texts. E.g., https://chatnoir-webcontent.web.webis.de/?index=cw22&uuid=CWLafZMrWbCnXKvqA7IKZg
I think this would be a good idea because we already have random-document access in ChatNoir, especially for large corpora like the ClueWebs, hence it would allow us to reduce the size of the static part that we host in data.tira.io
and we do not have to maintain it twice.
I would like this URL to be more like https://static.chatnoir.eu/?index=cw22&uuid=CWLafZMrWbCnXKvqA7IKZg or similar.
Indeed, we can change the Url under which ChatNoir provides the random document access, for the proof of concept, we could stick with the existing URL for the moment I think.
This is highly related to this issue: https://github.com/tira-io/tira/issues/594
The goal here would be to be compatible with this potentially upcoming pull request for PyTerrier: https://github.com/terrier-org/pyterrier/pull/436