osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)
13 stars 3 forks source link

Collection format #24

Closed amallia closed 5 years ago

amallia commented 5 years ago

According to the following the only information passed to the indexer is the collection name and the path where the collection can be found. I am wondering if it makes sense to pass the format of the collection.

python run.py prepare \
    --repo rclancy/anserini-test --tag latest \
    --collections [name]=[path] [name]=[path] ...

Also, what are the collections used? I can see from https://github.com/osirrc2019/jig/blob/master/init.sh that are going to be core17, core18, robust04. Will they actually be named in this way?

ryan-clancy commented 5 years ago

What kind of formatting information were you thinking? The collections are distributed in a standard format, are you referring to if we pass you the archive file vs. the extracted archive?

amallia commented 5 years ago

No I was thinking that if we use standard formats like warc, tractext or trecweb we don't really need to rely on the collection name to inform the indexer which parser to use. This gives makes everything a bit more abstract and solid. If the format is not provided we have to implement a mapping between a collection name and it is format. I.e. core18->trectext. Now, this looks very fragile since if the name gets changed from core18 to wapo everybody have to change their docker images.

ryan-clancy commented 5 years ago

Sounds good to me... I'll add it this week.

ryan-clancy commented 5 years ago

What is the best way to support multiple collections for this? Each may have a different format. We could do something along these lines:

We modify the --collections parameter for the prepare command from --collections [name]=[path] ... to --collections [name]=[path]=[format] ....

It's how we pass this to the image being run there's a couple different ways to do it:

The goal would be to have the least friction for a developer creating an image - thoughts? Alternatives?

amallia commented 5 years ago

I think the former is better as it is less redundant. Which is name=format...

Sorry for brevity.

ryan-clancy commented 5 years ago

Fixed in https://github.com/osirrc2019/jig/pull/41