Closed amallia closed 5 years ago
What kind of formatting information were you thinking? The collections are distributed in a standard format, are you referring to if we pass you the archive file vs. the extracted archive?
No I was thinking that if we use standard formats like warc
, tractext
or trecweb
we don't really need to rely on the collection name to inform the indexer which parser to use. This gives makes everything a bit more abstract and solid.
If the format is not provided we have to implement a mapping between a collection name and it is format. I.e. core18->trectext
. Now, this looks very fragile since if the name gets changed from core18
to wapo
everybody have to change their docker images.
Sounds good to me... I'll add it this week.
What is the best way to support multiple collections for this? Each may have a different format. We could do something along these lines:
We modify the --collections
parameter for the prepare
command from --collections [name]=[path] ...
to --collections [name]=[path]=[format] ...
.
It's how we pass this to the image being run there's a couple different ways to do it:
--collections [name]=[format] ...
instead of --collections [name]
--format
with a mapping from name
to format
The goal would be to have the least friction for a developer creating an image - thoughts? Alternatives?
I think the former is better as it is less redundant. Which is name=format...
Sorry for brevity.
According to the following the only information passed to the indexer is the collection name and the path where the collection can be found. I am wondering if it makes sense to pass the format of the collection.
Also, what are the collections used? I can see from https://github.com/osirrc2019/jig/blob/master/init.sh that are going to be
core17
,core18
,robust04
. Will they actually be named in this way?