Closed Dazzalytics closed 2 years ago
There are a few things to consider here:
distinct()
which is removing rows with the same match_id, innings, and ball info. This seems to occur when there are errors in the data. The effect of distinct()
is to take the first such row, but that is not necessarily the correct row. I would be reluctant to modify the downloaded data in this way unless we are sure that the deleted rows are actually incorrect. At the very least, a warning should be generated when rows are deleted with some information about why.If we can sort out these issues, I'd be happy to add this as an optional cleaning step in the fetch_cricsheet()
function. Perhaps clean=FALSE
by default so existing code doesn't break.
Thank you for the feedback.
Since the attached code was for T20 matches only, I went with the known breakdown of overs 1-6 (powerplay), 7-15 (middle), and 16-20 (death). I sort of like that we can squeeze in the "super over" under the phase as well. However, we can drop the phase variable for now.
ball variable has numeric values in the format: x.y, where x is the over and y is the raw ball number (including extra balls). Upon further investigation, x.1 and x.10 are being read as x.1 numerically. E.g., For IPL data, there are 18 instances of an over having 10 raw balls in the over, and PSL data has 5 instances. I am going to try to fix this on my end and notify the cricsheet about this issue.
I will update the variable names as per snake_case.
Please find the updated code file attached, addressing the issues discussed above. Please note that it is only for cleaning t20 data from the cricsheet (all t20 competitions).
Thanks. I've made a few changes and added it. I decided not to add the clean
argument as this clearly fixes some errors with the data. Let me know if there are any problems with the updated function.
Also, please add yourself as a contributor to the package. I didn't know what name to include.
Hello, The t20 ball-by-ball data from cricsheet is a little raw. Right now, the ball by ball data doesn't contain, among other things,
I have attached a file containing a (cleaning) function that addresses the above-mentioned points, and a bit more. Note that this is only for t20 data. However, it can easily be updated for ODIs and tests.
I am not sure whether to add this cleaning function to the fetch_cricsheet function or to keep it separate. What do you folks think? Cleaning Cricsheet T20 Data.zip