joaopalotti / trectools

A simple toolkit to process TREC files in Python.
https://pypi.python.org/pypi/trectools
BSD 3-Clause "New" or "Revised" License
163 stars 32 forks source link

Enforce explicit dtypes throughout all DataFrame-based classes #37

Closed lgienapp closed 2 years ago

lgienapp commented 2 years ago

Right now, both TrecRun and TrecQrel have whatever datatypes pandas automatically infers for each column.

https://github.com/joaopalotti/trectools/blob/5c1d56e9cf955f45b5a1780ee6a82744d31e7a79/trectools/trec_run.py#L43

This leads to inconsistencies across different experiments based on the doc_id and query names used: some get inferred as int64 (for example MSMARCO-Passage), some as str (for example ClueWeb12). While this is not a problem when working in the "intended" file-based way of loading and evaluating runs/qrels, it incurs some frustration when accessing the TrecRun.run_data and TrecQrel.qrels_data fields directly in in-memory experiments (related issue: https://github.com/joaopalotti/trectools/issues/13).

Here, the experiment code has to be customized to the inferred pandas-dtype for each collection. For example, when an in-memory run is to be evaluated with qrels loaded from disk, one has to make sure that the custom run.run_data matches the dtypes of the qrels, which may change based on the collection used. In my case, I assumed id fields to be string-based and ran into a lot of merge-on-different-dtypes errors.

Therefore, I would argue towards enforcing a consistent dtype (str) when loading runs/qrels from disk for the doc_id and query columns. This would not affect the file-based workflow in any way but greatly improve the memory-based experience and usability by providing a consistent type mapping for all collections.

joaopalotti commented 2 years ago

Hi @lgienapp, thanks for your message. I must say that I started planning to have as much flexibility as possible for the whole framework. At some stage, I even wanted to have flexibility in the column names (i.e., you could change from docid to _documentname if you wanted). But flexibility comes with lots of parsing errors that are frequently hard to identify.

I liked your suggestion of enforcing str type to both docid and query.

Would you mind creating a PR for that, please? Really appreciated your help!