thunlp / EntityDuetNeuralRanking

Entity-Duet Neural Ranking Model
MIT License
153 stars 20 forks source link

Next work -> snippet? #5

Closed pommedeterresautee closed 6 years ago

pommedeterresautee commented 6 years ago

I have noticed that like few other teams you are focusing on finding entities and trying to match entities from query to entity from document. It seems to me that no one is working on snippet because the dataset everybody is working on already have them, or teams just use document titles.

In my own exp with your implementations, I have noticed that the way I build snippet has large impact on the performance. In particular too long or too short snippet have too many or too few info, and it s quite obvious that words around the matching ones can provide lots of signal. Of course it s pure feature engineering for a supposed end to end learned model (requires to define where end to end starts). In production environment you mostly have access to the full document, and when you build snippet you decide in some way how much contextual information you are adding, this has (in my case) lots of impact. May be for your next work, if it still about ranking, you may want to work on this aspect :-)

EdwardZH commented 6 years ago

Thank you for your comments and you are absolutely right. This work preliminarily explores knowledge in ranking models and lots of work have to do in the next step. I think the ranking is important for lots of NLP tasks, so this work may encourage researchers to consider more models in neural IR for NLP tasks.

pommedeterresautee commented 6 years ago

I fully agree. My point is that in almost all NLP task (but classification) long documents (understand from 1 paragraph to several paragraphs) are not taken into account. It s the same in IR. I am quite certain there is an opportunity to take as no one seems to work on it, "attention" (and similar approach) is known to improve perf and many final users (not in research, meaning mainly companies and civil society) would be interested in it. May be, some dataset is missing. Anyway, it was just to share an idea, I am closing the "issue" :-)