hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Better `ir_datasets` integration. #37

Open eugene-yang opened 2 years ago

eugene-yang commented 2 years ago
  1. User-specified fields for ir_datasets.Docs object. If no field is provided, fall back to default_text() (a future convention ir_datasets is currently working on). If default_text() is not implemented, fall back to text field.

  2. Support arbitrary fields in TopicProcessor for abitrary query fields in ir_datasets. This is particularly important for integrating the mt_* and ht_* fields in the HC4 interface in ir_datasets. (citing discussion)

  3. Sample configs for running PSQ and human translated queries. A severely truncated translation table is added to ./samples/data for demo purposes.

close #32

cash commented 2 years ago

Overall this looks good. I still need to test locally. I will probably also cut down the sample config files. It looks like they every option and I want the samples to include just the options that are being used.

cash commented 2 years ago

ir-datasets hasn't done a release since HC4 has been added. @eugene-yang do you know if Sean is planning to make a new release soon or should we depend on a git commit?

eugene-yang commented 2 years ago

ir-datasets hasn't done a release since HC4 has been added. @eugene-yang do you know if Sean is planning to make a new release soon or should we depend on a git commit?

Basing the requirements on a git commit might break after a new version of ir-dataset is released. I think a better solution is to have @seanmacavaney tag the current version on master as a pre-release version so pip can resolve it as a version that is >=0.5.0 and can gracefully transit to later versions.

@seanmacavaney any thought on this?

cash commented 2 years ago

You can set pip to pull a particular commit

seanmacavaney commented 2 years ago

I'm happy to do a release of ir_datasets. On it now.

seanmacavaney commented 2 years ago

Done -- ir-datasets==0.5.1 is now on pypi, including hc4, neuclir, etc.

cash commented 2 years ago

@seanmacavaney thanks!

eugene-yang commented 2 years ago

@cash are we able to merge this?

cash commented 2 years ago

@eugene-yang I got stuck trying to download the data - tried multiple times and it never finished. I'll get back to testing this and fixing issues that we've identified.

eugene-yang commented 2 years ago

@cash I updated the download script couple weeks ago because the base URL changed for Common Crawl. Let me know if you still have issues downloading HC4.