tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
660 stars 110 forks source link

Colab Crash while running Similar Code to Ranking Example #166

Closed nmonette closed 1 year ago

nmonette commented 1 year ago

Hello,

When i run the following code in Colab (essentially copied from the tfdf ranking example), it kills my kernel. Even when I lower my number of examples from 84k to 16k, it still crashes, which is frustrating.

data = data[["pop"] + [i for i in data.columns if i != "pop"]]

tr = tfdf.keras.pd_dataframe_to_tf_dataset(data.iloc[:16000,:25], label="pop", task=tfdf.keras.Task.RANKING)
te = tfdf.keras.pd_dataframe_to_tf_dataset(f22, label="pop",task=tfdf.keras.Task.RANKING)

model = tfdf.keras.GradientBoostedTreesModel(
    task=tfdf.keras.Task.RANKING,
    ranking_group="position",
    num_trees=50)

model.fit(tr)
pred = model.predict(te)

Can anyone provide any assistance here?

To elaborate a little, this was the error in the runtime log:


WARNING | FATAL 2023-02-26T02:30:35.044723496+00:00 loss_interface.cc:146] The number of items in the group "2010144485757764457" is 2791 and is greater than kMaximumItemsInRankingGroup=2000. This is likely a mistake in the generation of the configuration of the group column.
rstz commented 1 year ago

Hi,

looks like you have stumbled upon an issue that will be fixed in the next version of TF-DF, we've addressed it recently. This will both avoid crashing Colab and increase the limit to 4096 items per group.

As a quick fix, make sure that each ranking group has at most 2000 items (e.g. remove items with low relevance).

Note that this error often (but not always) indicates a modeling issue: Looks like you have a ranking problem with many examples in the same group. In the typical document / query setting, this means that your dataset contains a single query that has a lot of documents (and corresponding relevance scores) associated with it. This is unlikely in most contexts, so make sure this is actually what you want. You should also be aware of the fact that the running time grows quadratically (AFAIR) with the number of items per group, so you might see performance degradations.

Please let us know if you believe your use case warrants a high number of items per group, since we're discussing increasing this parameter even further.

nmonette commented 1 year ago

I do think my case warrants a high number of items per group - in my particular example, I am trying to rank and recommend soccer players from the FIFA video game based off of queries of their positions (i.e. in my case there are 11 positions).

There is another part that you might be able to help with, because I am a little confused about this:

Are there any guides as to making predictions with the ranking model? I am a little confused on how the querying actually works from a prediction perspective. Specifically, if a model is trained on a certain query, how would I be able to apply that model to an entire dataset for each query?

Thanks :)

rstz commented 1 year ago

I'm currently in the process of improving the existing ranking guide :)

At a high level, ranking works as follows. You have a set of queries (e.g. search queries) and a set of documents (e.g. web pages). The queries are called groups by TF-DF. The rows of your dataset should be combinations of queries and documents, the columns should be features of the query and the row or both (e.g. some similarity score between query and document, number of words in the query, number of words in the documents, ...). With each row, you should have a relevance, which is an integer between 0 and 5, where 0 means "No similarity" and 5 means "perfect match". TF-DF will (I think) not complain if you use a larger relevance than 5, but there's no benefit of doing this.

In order to apply the model, you need the query (that's usually user-supplied), and a set of N "potentially relevant" candidate documents for the query (that's usually supplied by your program; this could be "all documents that share at least one token with the query" or "all documents" if the document space is very small, or ...). For each pair (user query, candidate document), you need to compute the features used when training the model. In a way, this gives you a "serving dataset" of N rows.

You use the model that you trained to generate a score for each of the N candidate documents. Then, you pick the candidate document that returned the highest score for your query.

rstz commented 1 year ago

Closing this with three more comments:

nmonette commented 1 year ago

Hi - I just wanted to follow up on the prediction tutorial, because I think something could use clarification.

When I train the model on multiple queries/groups (as is required by the model), how does the model know which query we are trying to rank when we call predict()? Just confused because we train the model on multiple queries but do not specify the query when we predict.