Closed lgienapp closed 2 years ago
Hi @lgienapp, thanks for your message. I must say that I started planning to have as much flexibility as possible for the whole framework. At some stage, I even wanted to have flexibility in the column names (i.e., you could change from docid to _documentname if you wanted). But flexibility comes with lots of parsing errors that are frequently hard to identify.
I liked your suggestion of enforcing str
type to both docid
and query
.
Would you mind creating a PR for that, please? Really appreciated your help!
Right now, both
TrecRun
andTrecQrel
have whatever datatypes pandas automatically infers for each column.https://github.com/joaopalotti/trectools/blob/5c1d56e9cf955f45b5a1780ee6a82744d31e7a79/trectools/trec_run.py#L43
This leads to inconsistencies across different experiments based on the
doc_id
andquery
names used: some get inferred asint64
(for example MSMARCO-Passage), some asstr
(for example ClueWeb12). While this is not a problem when working in the "intended" file-based way of loading and evaluating runs/qrels, it incurs some frustration when accessing theTrecRun.run_data
andTrecQrel.qrels_data
fields directly in in-memory experiments (related issue: https://github.com/joaopalotti/trectools/issues/13).Here, the experiment code has to be customized to the inferred pandas-dtype for each collection. For example, when an in-memory run is to be evaluated with qrels loaded from disk, one has to make sure that the custom
run.run_data
matches the dtypes of the qrels, which may change based on the collection used. In my case, I assumedid
fields to be string-based and ran into a lot of merge-on-different-dtypes errors.Therefore, I would argue towards enforcing a consistent dtype (
str
) when loading runs/qrels from disk for thedoc_id
andquery
columns. This would not affect the file-based workflow in any way but greatly improve the memory-based experience and usability by providing a consistent type mapping for all collections.