seatgeek / api-support

A support channel for the SeatGeek Platform
9 stars 10 forks source link

Event, Performer and Venue Scores #107

Closed ThomasLatham closed 4 years ago

ThomasLatham commented 4 years ago

In a similar vein to this issue (https://github.com/seatgeek/api-support/issues/7), I am an undergraduate student who is using SeatGeek data to perform analyses for my senior thesis, and I am very curious about how the score values are calculated for these categories. I have two related questions with regard to scores:

1.) What factors are considered in calculating score? If that is restricted information, could you please tell me whether the potential model predictors I've outlined below are already used to calculate scores?

2.) For what reasons would score values be null? I'm trying to categorize my null-containing columns in terms of MCAR, MAR and MNAR, and it would be incredibly useful to have insight from the subject-matter experts.

Edit: This is under the model assumption I've made that 0-valued scores are null. I noticed in the score distributions across the dataset that the values followed a somewhat normal curve between around 0.2 to 0.9, but also had large columns at 0. This led me to believe that those data points with 0 values for score actually did not have meaningful scores (otherwise, by my reasoning, they would be in the curve). I could be totally wrong in this assumption, however.

To explain why I'm asking these questions, I want to give a brief explanation of my project so far. My main research question, as of now, is: "Can we predict the popularity of concerts in the US and Canada?" I'm essentially taking event score (for around 20,000 distinct concerts I scraped back in May) as my response variable and seeing which other properties of a concert make for good predictors. Such predictors I'm considering for the model (I'm still in the EDA) include:

I've realized, however, that these very criteria (especially performer and venue score) may have already been factors in the original calculation of concert score. If this is the case, then I feel like the results of my model would be trivial. If this isn't the case -- and I can keep the columns as potential predictors in my model -- then it would be very useful for me to know if certain scores are more likely to be null than others, and why that would be.

Thank you very much for any insight you can offer me.

skritch commented 4 years ago

Hi Thomas -

The "score" field in our API, as mentioned in our api docs, represents:

estimated sales volume on the secondary ticket market

This is very much an estimate and you are correct to guess that it is derived from a model based on factors like those you've listed, but I can't share the exact features or structure of the model. This applies to event, venue, and performer score, moving_score, and popularity (those other two are undocumented).

Unfortunately, I would recommend not attempting to predict our score field directly, as it is only a rough estimate of that reality, with biases of its own, and is obfuscated in the API response. The purpose of this field is primarily to sort events reasonably within tight geographic- and time-window bounds, rather than to be very representative of actual secondary market demand. Therefore I would recommend trying to target some other proxy for concert popularity that is more directly reflective of demand.

ThomasLatham commented 4 years ago

Hi Sam -

Thank you very much for your response; what you have told me will certainly influence the direction and results of my project.