Open lukaszett opened 1 year ago
Got your _read_topics_xml PR - looks good.
The current read_topics_trec code is very old, and wrapped in Terrier. Bringing Terrier out of it would be desirable, ie untying Terrier from pt.get_dataset().get_topics()
.
You mean as in rewriting this https://github.com/terrier-org/pyterrier/blob/696a827a5323273b7b77f5ab8b652e1b9949d2c7/pyterrier/io.py#L311 in python?
I could do this - Would it be acceptable to have custom fields in the returned dataframe for you?
Yes. Essentially, a long term goal. The underlying java code is https://github.com/terrier-org/terrier-core/blob/5.x/modules/batch-retrieval/src/main/java/org/terrier/applications/batchquerying/TRECQuery.java
It applies https://github.com/terrier-org/terrier-core/blob/5.x/modules/batch-retrieval/src/main/java/org/terrier/indexing/TRECFullTokenizer.java internally IIRC.
Code probably first written around 2003. Happy for you to Use a Github Gist to make plan for what it would look like first.
@seanmacavaney may know of a pure Python for parsing SGML TREC topics files (or we can refer to ir_datasets)
This is what ir-datasets uses: https://github.com/allenai/ir_datasets/blob/master/ir_datasets/formats/trec.py#L286-L334
Thanks! I'll take a look at it tomorrow. I guess the easiest way would be to just wrap ir-dataset's functions with pt's existing function signature.
Sorry, I only just got around to working on this.
I think I need some clarification on how PT should parse topics: ir-datasets is quite strict about the expected file. You have to have a Querytype (just a namedtuple) ready that exactly matches all fields present in the file. Otherwise parsing the file will fail.
This clashes with PT's current implementation that accepts more formats (basically you just need an ID and a query field) but ignores custom fields. It is trivial to wrap ir-datasets' parser for pyterrier and using the default trec querytype. However, this will probably break backwards compatibility.
My original usecase which lead me to discover the problem with PT's parser was to parse a set of ~25 topic files from the trec website that are not yet in IR-datasets. Sadly, trec topic does not equal trec topic. There are loads of differing variants of tags, fields, spelling, abrreviations, xml strucure... . (Now I appreciate the ready-to-use datasets proviced by ir-datasets ;-) ) Basically none of the topicfiles I used matched IR-dataset's Trec querytype.
I spent some time writing code to infer a custom querytype from a given topicfile. Writing this code got a bit out of hand as I was not expecting there to be that many edge cases in topicfiles so it's a bit hacky - rewriting IR dataset's parser would have been simpler. This code is ultra permissive about the file's format. You can basically throw anything that barely looks like a trec (xml) file at it at it will parse it. however I'm not sure whether or not PT should be so permissive about the topics' formats.
If we decide that PT should accept many formats, the whole process of inferring querytypes can be cut short by simply modifying IR-datasets parsing of topics to make up the QueryType on the fly.
Thanks for the update @lukaszett. We appreciate your work here.
Strategically, I wonder if autodetect is just being a bit too lenient - if the code is only used through some kind of dataset library (either Pyterrier or IR-datasets) then we can tell it the format (including tags) to expect.
From UoG perspective, perhaps @seanmacavaney and I need a discussion about the coexistance of PyTerrier datasets vs IR-datasets long term.
I guess from your perspective @lukaszett, you are looking for a parser for TREC topics that arent included with PyTerrier datasets or IR-datasets? Then autodetect of format might be useful.
PS: tagging @isoboroff in a discussion about the variations of TREC topic formats :-)
I am not sure whether the current implemtation is a bug or the intended behaviour.
I noticed that with the current implementation of
read_topics
it is not possible to load any other fields than query and qid from trec files no matter what format is being used. All whitelisted tags are just combined into the query field of the resulting dataframe.Take for example the topics from the TREC microblog track in 2015:
Parsing this file with a whitelist of ['title', 'desc', 'narr'] would result in an extremely long query that consists of all tags concatenated.
This does not seem desirable to me as I want to be able to discern the narrative and the description from the title. I've started working on a patch to allow for custom fields when parsing trec files, however before investing any more time I just wanted to make sure that this actually is a bug.