Enforce explicit dtypes throughout all DataFrame-based classes

Right now, both TrecRun and TrecQrel have whatever datatypes pandas automatically infers for each column.

https://github.com/joaopalotti/trectools/blob/5c1d56e9cf955f45b5a1780ee6a82744d31e7a79/trectools/trec_run.py#L43

This leads to inconsistencies across different experiments based on the doc_id and query names used: some get inferred as int64 (for example MSMARCO-Passage), some as str (for example ClueWeb12). While this is not a problem when working in the "intended" file-based way of loading and evaluating runs/qrels, it incurs some frustration when accessing the TrecRun.run_data and TrecQrel.qrels_data fields directly in in-memory experiments (related issue: https://github.com/joaopalotti/trectools/issues/13).

Here, the experiment code has to be customized to the inferred pandas-dtype for each collection. For example, when an in-memory run is to be evaluated with qrels loaded from disk, one has to make sure that the custom run.run_data matches the dtypes of the qrels, which may change based on the collection used. In my case, I assumed id fields to be string-based and ran into a lot of merge-on-different-dtypes errors.

Therefore, I would argue towards enforcing a consistent dtype (str) when loading runs/qrels from disk for the doc_id and query columns. This would not affect the file-based workflow in any way but greatly improve the memory-based experience and usability by providing a consistent type mapping for all collections.

joaopalotti / trectools

Enforce explicit dtypes throughout all DataFrame-based classes #37