theCrag / website

theCrag.com: Add your voice and help guide the development of the world's largest collaborative rock climbing & bouldering platform
https://www.thecrag.com/
112 stars 8 forks source link

Finalise cpr params before release #2622

Closed scd closed 7 years ago

scd commented 7 years ago

We need to make sure the cpr params are finalised before release. @brendanheywood could you please tick these off if you are happy with them

Once the above is finalised

brendanheywood commented 7 years ago

I'm mostly happy. I still feel there is something not quite right with the black cpr line in the graph, in many cases it seems to float way above where my gut says it should be.

The whole point of this has been to create a performance grade which matches the level which you can consistently red/pink point. So the intention was to tune the grade shift back so that on average you needed to have done around 3 red points at grade X before you would be considered a solid grade X climber and have a performance rating which matches that grade.

So in theory if someone is ticking using all of the various tick types then their black cpr curve should be a fraction lower than their pink point curve (lets assume people are using pink when they should be)

Lee's matches this mostly ok:

image

Vanessa's does not:

image

https://dev.thecrag.com/ascents/aggregate/graph-sport-cpr/with-route-gear-style/sport/by/vwills/?embed=chunk&width=1200&height=800

In her case her cpr has been as high as the upper half of 28, despite having only ever red pointed a single 7b+ (== 26). It's been pushed up higher because she is onsighting almost to the same level as her red pointing. Maybe this means she is in theory capable of red pointing harder, but I think it's more likely we should interpret it to mean she has better onsighting skills than most climbers. We could use this to argue this means the onsight tick shift is too high.

I know you ran all the stats to find the tick shifts and I don't really want to open that again, but I do want to communicate it really clearly. I think the best way to do that is in the CPR article actually showing the distribution curves and then saying "we picked the Xth percentile as the final tick shift" as well as the final exact shift we are using shown on the graph. How hard would it be to just hard code those distribution curve values somewhere and provide it as template data?

Maybe for round 2 it would be nice to create new stats for each account which calculates the average gap between each of the curves for say the last 5 years. So Lee's onsight-redpoint delta would be around 2.5 and Vanessa's around 1.5. It would be interesting to see those numbers directly as well as graphically on a profile.

And then ultimately we could run an aggregate stats process on top of these account stats in order to come up with the distribution across all climbers so the template data is real and not hard coded, and which we could then also use directly to tune the algorithm instead of doing it all outside the system.

scd commented 7 years ago

It was very hard to get tick shift stats and I had to make a lot of assumptions. One of the big issues was that at lower grades (eg below 21) people mostly onsighted rather than red pointed. So the shifts were grade dependent. I ended up dimensioning for harder climbers. To be honest as soon as I got something that looked half decent I stopped the analysis so we could move on.

I actually don't think it is really grade dependent, but a desire and accessibility bias at lower grades.

After seeing the world graphs I think this validates that the black line is about right for harder climbers.

I fully expected to re-open this discussion once we had this graph to visualise.

How about we ask Vanessa about her red pointing, she just might not care about it.

I just did not care about red pointing. Out of the hundreds of climbs I did, I was onsighting up to grade 24 but virtually no red points. In my case it would have been great to know that it was time for me to red point a few routes to push my grades, so I would have loved to have seen the floating black line.

What I want to do is an average tick shift for all grades across all accounts based on all the cpr curves. This will eliminate assumptions and I think is far more robust. For every account get the cpr graph, traverse the time slices and average tick shifts at each internal grade level.

brendanheywood commented 7 years ago

After seeing the world graphs I think this validates that the black line is about right for harder climbers.

Yeah that may be true but I really want to avoid it only being useful to harder climbers. That was one of the starting fundamentals.

How about we ask Vanessa about her red pointing, she just might not care about it.

She has ~1000 red and pink point ascents, it's good data, on par will Neil and Lee etc

I was onsighting up to grade 24 but virtually no red points.

Purely for my curiosity: I don't quite understand what you mean by this. Does this mean that if you fell on a route, then you'd clean it and never attempt it again on purpose? Or does it mean you were never pushing your grade so you didn't fall that much?

What I want to do is an average tick shift for all grades across all accounts based on all the cpr curves. This will eliminate assumptions and I think is far more robust. For every account get the cpr graph, traverse the time slices and average tick shifts at each internal grade level.

This sounds like exactly what I was suggested for round 2. Are you suggesting you want to do it now? I think we should launch as-is as soon as possible and clearly state that our algorithms will definitely change and be improved upon over time. I think people will accept this as a given in the same way that everything else from google rankings and tinder matches will get better and improve over time.

scd commented 7 years ago

Yeah that may be true but I really want to avoid it only being useful to harder climbers

But ranking is more important for harder climbers. I actually still think it is useful for lower grade climbers. It would have been super useful for me.

Maybe even at the harder grades we want red point to be the dominant tick type, so having a tick shift one grade lower than the stats suggest is a reasonable bias.

Does this mean that if you fell on a route, then you'd clean it and never attempt it again on purpose? Or does it mean you were never pushing your grade so you didn't fall that much?

Yes I did fall, so it is a bit of an exaggeration. But if I did not get it the same day I would not come back. This means my red points were no harder than my onsights. Working a route just was not important to me. I never attempted anything I did not think I could onsight.

This sounds like exactly what I was suggested for round 2. Are you suggesting you want to do it now?

If I have to invest a couple of hours of time then I will do it this way.

A while ago I have put some graphs up in you first power ranking google doc. If any of these are ok then we can use them for the article.

If we want to change the tick shifts then we need a basis for the change. All the grunt work for the algorithm is in the cpr graphs, so I think it is just a matter of iterating through all accounts. A couple of hours of dev time and probably more hours of processing time.

I don't think we can answer some of these questions properly without re-doing this work.

Drazhar commented 7 years ago

Hey, this is slightly off topic, but I saw that it's already possible to sort by the performance rating and all time performance rating. I found it a little strange, that a 6c+ onsight gets a higher rating than a 7b pinkpoint in sport climbing. I know that pinkpoint isn't used that much in sport climbing, but with pre clipped express slings it's slightly easier than if I had to clip those in.

For me, a 7b pinkpoint in a sports route is harder than a 7a onsight. And really only slightly easier than a 7b redpoint. This is different at trad routes where you also have to place other gear.

Just something to think about.

rouletout commented 7 years ago

I actually think that Vanessa’s case is perfect and confirms what we expected - if she is only on-sighting she could - according to our tick shifts and most likely in reals life - climb 2 to 3 grades harder if projecting. That’s what the CPR graph will tell her - if you care, start projecting to push your limits, if not, continue on sighting ;-)

scd commented 7 years ago

I agree with Ulf's assessment, however I would think that most people will think that they could not red point 3 grades above their onsighting grade and therefore disagree.

I also agree with Brendan that the floating black line does not look good.

My feeling is that for global ranking acceptance we have to take into account what the elite are doing, but for every day climber acceptance we have to meet their perceptions.

I am going to take another look at the stats. I am pretty sure that the stats will say that the tick shift is grade dependent. I am also going to make an assessment on how hard it would be to implement a variable tick shift based on grade difficulty. Because we store the tick shift on the ascent I think this should be fairly easy.

Regardless I am concerned that my stats analysis has some flawed assumptions for bouldering and pink points. If we don't re-look at the analysis then we have to be VERY CLEAR that the tick shifts are preliminary designed to get feedback.

brendanheywood commented 7 years ago

if she is only on-sighting she could - according to our tick shifts and most likely in reals life - climb 2 to 3 grades harder if projecting.

Yes but this is the exact opposite of what it is actually telling us. Vanessa logs heaps of red points and pink points as well as plenty of dogs and ghosts. She is attempting plenty of hard stuff. Vanessa is the most prolific ticker on the platform, and her stats are sort of an anomaly. I think it's a big mistake to assume her stats are wrong.

brendanheywood commented 7 years ago

Ok well Vanessa's email has helped cleared that up for her stats. I'm pretty comfortable with the 3 parameters as is. I still think we should do the extra stats per account and discussed above but I don't think it's a blocking issue for the release and we can defer it. We should still make it clear that anything is open for tuning in future and always will be, and indeed we encourage feedback into it all.

For me, a 7b pinkpoint in a sports route is harder than a 7a onsight. And really only slightly easier than a 7b redpoint. This is different at trad routes where you also have to place other gear.

@Drazhar - as we finish up this work we will be making an article which exposes all the shifts used to the stats. btw here is your sports graph which suggest a gap of ~2 french grades between onsight and red point, and another 1-2 grades on top of that for pink point. This would suggest around ~3 grades which is fairly consistent with what we've seen. Does that match your expectations?

image

scd commented 7 years ago

I have done the stats, because hooking into the cpr timeline framework was just too tempting and I really felt like our first version of stats had way too many assumptions.

tick-shifts-by-grade-boulder tick-shifts-by-grade-sport tick-shifts-by-grade-trad

brendanheywood commented 7 years ago

I've not really sure how to interpret any of those, and I don't know what you've done behind the scenes to produce it so can't comment at all on general validity. How did you come up with all this and is their some raw data we can see to play with? Is there now a set of statistics for cpr curve gaps every account?

scd commented 7 years ago

Raw data is here: https://docs.google.com/spreadsheets/d/1KMIujqnSWPx3EFtARgVayzgA4KQEU-XWIvFZUuQd5Y0/edit#gid=1031742143

script file is: script/test-tick-shift-using-cpr-timeline

The script runs through every account and gets the data for each of the cpr timeline curves. For each tick type grade the script gets the tick shift then averages across all accounts.

There are no tick shift statistics for a particular account, it is the average of all accounts.

If we do do fixed shifts for this version then clearly they are a bit arbitrary. Here is my new proposed shifts if we are doing fixed.

Trad shifts

Sport shifts

Boulder shifts

scd commented 7 years ago

@brendanheywood are you ok with going with a decay somewhere half way between the two decay ranges we tested (0.5 to 1 ewbanks per year). We felt that 0.5 was not enough and feedback from elite climbers was that 1 ewbanks was too much?

Drazhar commented 7 years ago

At sport I would recommend giving pink point the same amount of points as red point, or only really slightly less, like -1.

And at bouldering i would also make no difference (or only 1 point) at onsight and flash. I know that there is quite a difference if I try a boulder after seeing someone doing it or get to a unknown boulder. But who really differentiate between this?

In my opinion this will only cause people to change their ticking behavior of they know that they will get more points of they tick a slightly different ascent type.

scd commented 7 years ago

@Drazhar do you realize that the above shifts are all in internal grades. One internal grade is equivalent to about one fifteenth of a normal grade.

In my opinion this will only cause people to change their ticking behavior of they know that they will get more points of they tick a slightly different ascent type

We want to motivate people to tick the correct grades. I don't think many people will lie, but they still may class a boulder as an onsight if they go straight after their mate. I think that is fine, as the analysis is based on what people are actually ticking. Over time this will work itself out.

We want to base our shifts on statistics rather than how we feel. The issue we need to grapple with is that tick shifts clearly have a grade dependency. If we put in this grade dependency then the difference between onsights and red points almost disappears at lower grades, which is pretty much what you are suggesting. We are in two minds about doing this, and I would much rather do a fixed shift first and change to a variable shift rather than the other way around as this is more likely to give us robust feedback IMO.

Drazhar commented 7 years ago

I'm not completely sure how the internal grades are calculated, but I'm aware that you have to do this because of the variety of different grading systems.

My concerns are not that people will tick the wrong grades to get more points, even if we see this at 8a.nu also, where everybody will tell at some routes, that they are to easy for the grade, but nobody will downgrade them because of points. My concern is rather that you loose the variety of tick types over time. For example "Actio directe" (9a) in Frankenjura. Everybody who red points it has pre placed gear and most of the guys even have the rope preclipped in the first bolt. It's completely fine that this count as an red point, but if someone would normally tick it as an pink point, he will get less points, even if he is doing the same thing.

But let's see what happens and how many climbers are really looking at the cpr and comparing them to others.

scd commented 7 years ago

I think these are secondary concerns, for the simple reason that people go to 8a for ranking not theCrag. Our focus is with the personal rating, and the new cpr system is way more powerful for your personal rating if we base our system on real statistics.

Everybody who red points it has pre placed gear and most of the guys even have the rope preclipped in the first bolt. It's completely fine that this count as an red point, but if someone would normally tick it as an pink point, he will get less points, even if he is doing the same thing

Most people who did this at the lower grade ranges, or on trad routes would say this is a pink point. I know that there is a different community opinion for hard sport routes. This is apparent in the statistics, where pink pointing tick shifts pretty much become red point tick shifts at higher grades.

If you understand the implications of variable tick shifts your example would mean that a red point and pink point of this route would get the same points.

We are not releasing something that is set in stone. It is highly likely that we will make significant changes in the first 12 months.

brendanheywood commented 7 years ago

For each tick type grade the script gets the tick shift then averages across all accounts.

This is the bit that really worries me and where it will be very easy to lie to ourselves. What types of accounts are included? Do we have thresholds of minimum ticks etc like we did originally? Are you only using the final cpr's including decay for right now, or are you calculating based on the raw tick data without any. I think the best way to validate this is by seeing the individual shifts that we are calculating for each user in order to make sure they are correct before we aggregate them in any way.

test-tick-shift-using-cpr-timeline - where exactly is this stored?

@Drazhar - yes we are aware specifically of the issues with red vs pink, there is actually a very large cultural factor here too between different countries. There is no perfect approach, either way we are penalizing someone.

scd commented 7 years ago

This is the bit that really worries me and where it will be very easy to lie to ourselves. What types of accounts are included? Do we have thresholds of minimum ticks etc like we did originally? Are you only using the final cpr's including decay for right now, or are you calculating based on the raw tick data without any.

It is more likely that we are lying to ourselves with the old analysis. The graphs above are broadly in line with the old analysis, in particular the grade dependency was there in the old analysis. The old analysis is not robust and I had to make lot's of assumptions to get anything that looked vaguely reasonable. So if this worries you we should definitely not release without resolving.

Yes I do use decayed cpr in the comparison.

Problems with the old anlaysis

I am very happy to have whatever filter we jointly decide for the new analysis, but I am a little confused that you even agree with the methodology and projecting onto the filters. If this is the case then I strongly recommend that we need to abandon the existing tick shifts and start again with an agreed methodology.

The charts explain some of the dependencies we have noticed in various time lines.

I am going to run the new stats on my system and compare some charts.

This is officially a blocking issue as we don't have agreement.

I think the best way to validate this is by seeing the individual shifts that we are calculating for each user in order to make sure they are correct before we aggregate them in any way.

Individual tick shifts will have the same grade dependency. Look at Lee's account, at lower grades his tick shift is different to higher grades.

test-tick-shift-using-cpr-timeline - where exactly is this stored? CodeBase/Custom/Scripts and is installed in /opt/CIDS/scripts. But alas the old dev system does not install this from repo.

brendanheywood commented 7 years ago

It is more likely that we are lying to ourselves with the old analysis.

I agree the old analysis was very rough, and I also want to reiterate that I was, and am still, happy with it being rough and shipping as is and tuning it progressively later.

I'm not making any judgment yet around the new stats, I'm still just trying to understand what I'm even looking at and exactly how it was derived. Even now reading that script code I'm still finding it pretty hard to visualize this. I'd find this much easier to understand and validate if I could look at ten real charts for 10 accounts and see what the algorithm come out with as the tick curve gaps for that particular user. I just want lots of simple concrete examples on demand so we can sanity check and spot check as needed. And in particular use spot tests to help decide which accounts we include and exclude from the stats we use to derive the final params we choose.

eg I just want to see real raw data for Lee and Vanessa and everyone else we've already been testing with, preferably in the platform where it is really easy to see.

I am very happy to have whatever filter we jointly decide for the new analysis, but I am a little confused that you even agree with the methodology and projecting onto the filters. If this is the case then I strongly recommend that we need to abandon the existing tick shifts and start again with an agreed methodology.

I'm not following any of that comment, but I think may be reading too much into my comments above.

So despite still being largely in the dark data wise, I think we should filter down the accounts we use:

Lastly when we are actually visually comparing any two tick types, I think using an average will obscure a lot of the information, especially while we are still sorting out the details of accounts that are are included or not. I think it would be better to use a scatter plot with say red point on one axis and onsight on another, and every point represents one account. Maybe even use larger dots to represent accounts with more ticks which are more reliable data. If the data does turn out to be fairly clean and well correlated then we should be able to clearly overlay multiple tick types onto the same plot in different colors with little overlap and that would be gold in the cpr article.

If we are lucky we will end up with something like this showing each tick type as a well correlated band which hopefully little diffusion into the other bands and hopefully a very strong and mostly linear correlation.

image

The ideal end game is a new type of aggregate chart on the climbers facet. If we are going down this rabbit hole now then we may as well do it properly. eg

https://www.thecrag.com/climbing/australia/climbers/aggregate/graph-sport-scatter-plot/

This would also make it super easy to look for different patterns in the scatter plot across countries, gender, age etc (hopefully to rule them out as significant factors)

scd commented 7 years ago

I'd find this much easier to understand and validate if I could look at ten real charts for 10 accounts and see what the algorithm come out with as the tick curve gaps for that particular user

Good idea and will do. Actually it would be interesting to see if tick shift is tied to be a characteristic of the person, which I think you are suggesting it might be. If it is a characteristic of the person then I think there is something way more fundamental going on hear, because I think the stronger climbers have a bigger tick shift.

I think I want to get to the stage of selecting accounts for stats, but do not need to get to the point of doing the scatter diagram proper. My main aim was to get confidence in the original tick shifts.

brendanheywood commented 7 years ago

Actually it would be interesting to see if tick shift is tied to be a characteristic of the person, which I think you are suggesting it might be.

I think there will be some sort of theoretical value we can assign to each climber, maybe we could call it 'tenacity' (maybe bloody mindedness?). And this is how much they are willing and able to work something until they can tick it. After talking to Vanessa I have a suspicious that the tick shift across accounts will be less variable than I originally expected, but instead it's people's 'tenacity' that gets them closer to it filling the gap or not. It seems Vanessa and you are in a similar group where you just didn't care about working things, so you'd have a very low tenacity and you'd expect your red point to be only marginally above onsight. JJ jumps to mind as someone on the opposite end of the spectrum, and his curves are 4-5 or even 6 grades apart. I'm surprised I haven't looked at his chart already:

image

https://dev.thecrag.com/ascents/aggregate/graph-sport-cpr/with-route-gear-style/sport/by/jjobrien/?embed=html&width=1000&height=600

BUT the big problem though is that we need attempts data or dog data in order to help infer that, and we can't rely on this at all. It's one of the fundamentals axioms of accepting partial data and there isn't much we can do about it.

If it is a characteristic of the person then I think there is something way more fundamental going on hear, because I think the stronger climbers have a bigger tick shift.

I think the truth of that will simply be that grade shift will just end up being a function of grade, that makes complete sense to me, and the data should shortly confirm or deny that.

scd commented 7 years ago

@brendanheywood I have been working on the CPR article. The section solidifying a new grade is essential the finalGradeShift. In the google doc version, we say that it requires 3 ascents to solidify a grade. Mathematically this corresponds to a shift of -9.5 internal grades, so we should use this. Note that the article now deduces the 3 from our power shift configuration.

Are you fine with this?

There maybe some downstream issues.

brendanheywood commented 7 years ago

Yes this is cool with me

scd commented 7 years ago

@brendanheywood , I did some further tick shift analysis work. As you suggested I have got some example individual accounts to show what the tick shift by internal grade algorithm comes up with. Hopefully this will give us both more confidence of the analysis.

I have put my first attempt at this in the google doc here:

https://docs.google.com/spreadsheets/d/1XUE4BzHTwG32koXNpv6H5zTxGNNAM4xrml4WCcK3eRY/edit#gid=441772954

Note that I have not spent much time prettifying the graphs so hopefully they are clear enough.

It revealed a grade dependency in most accounts, but as you commented we could be lying to ourselves by not correctly filtering.

So I re-did this with more filtering (the criteria is listed in the google doc) here:

https://docs.google.com/spreadsheets/d/1hzfDxDQ-EwIT2LSnXODFBmFoA8cYlOgPWgjDD5nZBpE/edit#gid=764114784

Please have a look at the criteria, to make sure you think we have done enough filtering to give us confidence in the results. I think this shows a clear grade dependency overall.

So I have redone this for all accounts here:

https://docs.google.com/spreadsheets/d/1uOnI-uOgguIEKtzAPDlpFZJYfh5ZSmLFVLKOMK6s4GM/edit#gid=1259981018

Without looking at the exact figures the conclusions are similar to before. In particular onsight tick shift is grade dependent. If this is true then it is flawed to assign a single tick shift for a particular user for a scatter plot diagram.

Do we need more filter criteria? It is listed in the spreadsheet.

Is there another reason why my methodology may be flawed? Note that the basic premise of the methodology is that we want to minimise the 'float' area in the graph for the average climber.

Is there a way we can modify Brendan's scatter plot idea taking into account grade dependency?

If @brendanheywood is ok with the filter criteria and the methodology then I am happy to draw conclusions on this analysis for the release. I will do a spin off article with some graphs with a big notice that this is preliminary analysis. The tick shifts will be fixed for this release. Later we can make them grade dependent and update the article with Brendan's scatter plot idea.

brendanheywood commented 7 years ago

This all seems reasonably valid to me. The only filter off the top of my head which could improve things is making sure there are fairy regular ticks which bump up the cpr curves so we are comparing two real cpr's rather than 1 cpr vs a decayed cpr.

I think for the release though this doesn't really change anything other than putting more important on being clear that the algorithms will be refine ongoing?

scd commented 7 years ago

I am closing as stats are now being rebuilt on prod.

Drazhar commented 4 years ago

Hey, this topic is long closed, but I feel my question fits here the best and opening a new one would probably be overkill.

Are you still tweaking the parameters for the CPR? I have the feeling that the decay should be higher than right now. I'm not sure about the elite climbers, but for everybody I know the CPR should decay faster to represent the current level.

Also is there anywhere written down how the CPR is calculated exactly? A friend of mine (Ulfi you should know Martin) still favors 8a.nu because it's "CPR" is a lot simpler. It only counts the 10 hardest ascents in the past year. In contrast TheCrag only does some magic in the background.

Best regards, Philip

rouletout commented 4 years ago

@Drazhar thanks for your comment. CPR was presented at the IRCRA conference in 2018 and published. We also state in these publications that the topic of decay is probably the one with most room for improvment. We talked to several trainers and experts in the field and decided to leave as is as there is no fact based evidence to change it for now.

As for Martin, please tell him that because something is simpler it is not necessarily better or right ;-). He is an engineer and he should know.