terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
397 stars 63 forks source link

fix parsing of trecxml topics #414

Closed lukaszett closed 8 months ago

lukaszett commented 8 months ago

The current implementation of _read_topics_trecxml is expecting the topic number to be supplied via attribute to the topic tag. However, this does not seem to be the current way TREC is formatting topics in xml. See for example the health misinfo topics:

<topic>
<number>101</number>
<query>ankle brace achilles tendonitis</query>
<description>
Will wearing an ankle brace help heal achilles tendonitis?
</description>
<narrative>
Achilles tendonitis is a condition where one experiences pain in the Achilles tendon located near the heel. An ankle brace is usually worn around the ankles to protect and limit movement. A very useful document would discuss the effectiveness of using ankle braces to help heal Achilles tendonitis. A useful document would help a user make a decision about the use of ankle braces for treating tendonitis by providing information about recommended treatments for Achilles tendonitis, ankle braces, or both.
</narrative>
<disclaimer>
We do not claim to be providing medical advice, and medical decisions should never be made based on the stance we have chosen. Consult a medical doctor for professional advice.
</disclaimer>
<stance>unhelpful</stance>
<evidence>
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3134723/
</evidence>
</topic>

My change should not break backwards compatibility, also added a test to confirm this (see topic 3 still using an attribute to set the number).

cmacdonald commented 8 months ago

Nice job. Could you add your name and affiliation to the README.md list please?

(see topic 3 still using an attribute to set the number).

Could you add an XML comment to the test file that says this. <!-- Use attribute rather than tag --> or something similar.

cmacdonald commented 8 months ago

super, thanks @lukaszett