Add classical ranking signal

pommedeterresautee commented 6 years ago

I have noticed that most DL matching paper focuses on semantic matching which makes sense because embedding brings a new way to find support in words from snippet not present in the query (but still similar).

However, for some strange reasons other signals are totally forgotten in papers, even when in the benchmarks, SVMRank is used as a baseline with classical features are used.

Since 2016 I have seen many papers from recsys using these classical signals in nnet. For that they bucket the continuous values making them categorical values, and then... attributes embedding to each categorical value. After that... it depends of the model, but it seems that MLP would be an Ok solution (Linear + Relu + Dropout).

Most important signal in search are age/date of the publication, past popularity (short and long term in clicks for instance), and type of content. The first two variables are continuous variables requiring bucketing.

I am wondering if in your work you plan to focus on adding these signals?

pommedeterresautee commented 5 years ago

Hi, I have been quite busy lately but I still had the time to make some tests with additional signals and observe variation in scores.

Basically, when I have added age and 2 content type metadata, I gain around 1 absolute point of MAP. Age signal has been converted to categorical variable through bucketing (age of the content when the query have been issued: 1 day, 3 days, 1 week, 2 weeks, 1 month, 6 months, 1 year, 2 years, 5 years, older). It s good but not as strong as just separate title matching from snippet matching. I tried small modification in the bucketing without any change. I tried to add CNN 1D of the query and the snippet to the additional signals (so content type is compared to query representation, snippet representation) without any significant change in score.

2 still yet to try:

adding first words of a document as another thing to match (with the matrix and so on), the idea is that in many of our contents, first sentences are an introduction and provide information about what is inside the document
adding past popularity signal (clicks) + bucketing

EdwardZH commented 5 years ago

You are abosolutly right. We have found that user click is strong signal for ranking and we will soon publish a paper for query expansion. Recently, we do lots os experiments on MS-MARCO and the resources will soon update.

pommedeterresautee commented 5 years ago

By curiousity, how are you adding the popularity of a document ? Through bucketing or directly adding a count as a dimension ? Or the log count ?

EdwardZH commented 5 years ago

This work not adds popularity information but add clicked documents and is not my work. This paper is under reviewed, so we can communicate after it is published.

EdwardZH commented 4 years ago

Hi, for more neural IR training, data augmentation and more. Please refer to our WWW2020 Paper Selective Weak Supervision for Neural Information Retrieval. Thank you for your attention.

thunlp / EntityDuetNeuralRanking

Add classical ranking signal #10