hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Better integration with ir_datasets #32

Open eugene-yang opened 2 years ago

eugene-yang commented 2 years ago

Just realized that we should have a more robust integration with ir_datasets. The current implementation hardcoded the field we extract text from ir_dataset document objects and query objects.

For the document part, I think we can add a field attribute to encode that information in DocumentsInputConfig. This can also serve the purpose when people try to pass in their own jsonl document files instead of just HC4.

For the topic part, we currently only support title/description/narrative as input. But there could be things that are more complicated. The current implementation we have for ir_datasets integration is hardcoding to take text field as the query, which is way too naive for most cases. I could parse all title/description/narrative format if exists but there are thing beyond these two topic structures. For example, TREC Spanish has two descriptions per language for each topic; TREC Fair ranking track has something more complicated.

In order to support arbitrary fields, I believe we need to move away from using dataclass for the Topic object. @cash do you think that's a reasonable change?

And thanks, @seanmacavaney for pointing out :)

seanmacavaney commented 2 years ago

I think you’re probably pretty safe only supporting str-typed query fields — e.g. the TREC Fair Ranking query’s “keywords” are for a pretty specialised use case and not likely to be used directly as a query in most settings. text and title/description/narrative queries are by far the most common, and I think others could be handled easily enough just by specifying the desired field name in the yaml, as suggested for documents.

eugene-yang commented 2 years ago

Per our discussion, here are the things I will start working on.

  1. For documents, we add an optional field in the config file to select the text field in the document object returned from ir_datasets. We will also be using + to indicate concatenation of the fields. If the user does not specify this, we fall back to the generic text field. @seanmacavaney is this a reasonable choice?

  2. For topics, we are probably fine just supporting title/description/narrative for now. But unconventional field names appear in CLIR collections more often. We might want to expand this to support arbitrary query field (which requiring changing the topic class in Pataspco as well).

seanmacavaney commented 2 years ago

If the user does not specify this, we fall back to the generic text field.

Yeah, this sounds reasonable to me. We have a goal to eventually to ensure that all documents contain a "text" field, but for now note that not all documents may have one, and that you may need to raise an error if none exists.

You may also want to add some validation that the user only selected fields that are strings -- integrations with other packages have the same restriction.