terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
412 stars 65 forks source link

Read fields other than query and qid from trec topics #415

Open lukaszett opened 11 months ago

lukaszett commented 11 months ago

I am not sure whether the current implemtation is a bug or the intended behaviour.

I noticed that with the current implementation of read_topics it is not possible to load any other fields than query and qid from trec files no matter what format is being used. All whitelisted tags are just combined into the query field of the resulting dataframe.

Take for example the topics from the TREC microblog track in 2015:

<top>
<num> Number: MB446

<title>
lacrosse tournaments

<desc> Description:
Return announcements of and commentary regarding lacrosse tournaments.

<narr> Narrative:
The user's daughter likes lacrosse and wants to attend some upcoming
lacrosse tournaments.  She wants to see any tweets that relate
to a tournament.  Tweets about a tournament from its participants
including tweets that express anticipation of the tournament or traveling to/from the tournament or tweets that comment on the quality of a
tournament are relevant.
</top>

Parsing this file with a whitelist of ['title', 'desc', 'narr'] would result in an extremely long query that consists of all tags concatenated.

This does not seem desirable to me as I want to be able to discern the narrative and the description from the title. I've started working on a patch to allow for custom fields when parsing trec files, however before investing any more time I just wanted to make sure that this actually is a bug.

cmacdonald commented 11 months ago

Got your _read_topics_xml PR - looks good.

The current read_topics_trec code is very old, and wrapped in Terrier. Bringing Terrier out of it would be desirable, ie untying Terrier from pt.get_dataset().get_topics().

lukaszett commented 11 months ago

You mean as in rewriting this https://github.com/terrier-org/pyterrier/blob/696a827a5323273b7b77f5ab8b652e1b9949d2c7/pyterrier/io.py#L311 in python?

I could do this - Would it be acceptable to have custom fields in the returned dataframe for you?

cmacdonald commented 11 months ago

Yes. Essentially, a long term goal. The underlying java code is https://github.com/terrier-org/terrier-core/blob/5.x/modules/batch-retrieval/src/main/java/org/terrier/applications/batchquerying/TRECQuery.java

It applies https://github.com/terrier-org/terrier-core/blob/5.x/modules/batch-retrieval/src/main/java/org/terrier/indexing/TRECFullTokenizer.java internally IIRC.

Code probably first written around 2003. Happy for you to Use a Github Gist to make plan for what it would look like first.

cmacdonald commented 11 months ago

@seanmacavaney may know of a pure Python for parsing SGML TREC topics files (or we can refer to ir_datasets)

seanmacavaney commented 11 months ago

This is what ir-datasets uses: https://github.com/allenai/ir_datasets/blob/master/ir_datasets/formats/trec.py#L286-L334

lukaszett commented 11 months ago

Thanks! I'll take a look at it tomorrow. I guess the easiest way would be to just wrap ir-dataset's functions with pt's existing function signature.

lukaszett commented 10 months ago

Sorry, I only just got around to working on this.

I think I need some clarification on how PT should parse topics: ir-datasets is quite strict about the expected file. You have to have a Querytype (just a namedtuple) ready that exactly matches all fields present in the file. Otherwise parsing the file will fail.

This clashes with PT's current implementation that accepts more formats (basically you just need an ID and a query field) but ignores custom fields. It is trivial to wrap ir-datasets' parser for pyterrier and using the default trec querytype. However, this will probably break backwards compatibility.

My original usecase which lead me to discover the problem with PT's parser was to parse a set of ~25 topic files from the trec website that are not yet in IR-datasets. Sadly, trec topic does not equal trec topic. There are loads of differing variants of tags, fields, spelling, abrreviations, xml strucure... . (Now I appreciate the ready-to-use datasets proviced by ir-datasets ;-) ) Basically none of the topicfiles I used matched IR-dataset's Trec querytype.

I spent some time writing code to infer a custom querytype from a given topicfile. Writing this code got a bit out of hand as I was not expecting there to be that many edge cases in topicfiles so it's a bit hacky - rewriting IR dataset's parser would have been simpler. This code is ultra permissive about the file's format. You can basically throw anything that barely looks like a trec (xml) file at it at it will parse it. however I'm not sure whether or not PT should be so permissive about the topics' formats.

If we decide that PT should accept many formats, the whole process of inferring querytypes can be cut short by simply modifying IR-datasets parsing of topics to make up the QueryType on the fly.

cmacdonald commented 10 months ago

Thanks for the update @lukaszett. We appreciate your work here.

Strategically, I wonder if autodetect is just being a bit too lenient - if the code is only used through some kind of dataset library (either Pyterrier or IR-datasets) then we can tell it the format (including tags) to expect.

From UoG perspective, perhaps @seanmacavaney and I need a discussion about the coexistance of PyTerrier datasets vs IR-datasets long term.

I guess from your perspective @lukaszett, you are looking for a parser for TREC topics that arent included with PyTerrier datasets or IR-datasets? Then autodetect of format might be useful.

cmacdonald commented 10 months ago

PS: tagging @isoboroff in a discussion about the variations of TREC topic formats :-)