macss-modeling / General-Questions

A repo to post questions about code, data, etc.
0 stars 0 forks source link

Not able to obtain median of 39.5 in Python #1

Closed k-partha closed 3 years ago

k-partha commented 3 years ago

I am unable to obtain the median value listed, probably because of differences between how R and Python treat NA values. After coding NA as an NA value, Python drops the entire table. If it is excluded, 1147 observations remain. After setting a threshold of 590 observations for the dropna() function, 82 values can be obtained - but the median of these values does not equate to 39.5. Setting the threshold either one below or above yields an incorrect number of observations. On closer inspection, excluding '998' as an NA value yields 88 observations - but even this dataset does not yield a median of 39.5. (Note that including 998 as an NA value is necessary as it appears in ftobama.) It appears that this discrepancy is probably due to minute differences in how Python and R treat NA values. How should I proceed from here?

pdwaggoner commented 3 years ago

Check your email; this has been addressed. Regardless of differences between R and Python, omitting every row with a missing value will result in ~100 observations remaining. So if canned functions aren't handling NA's efficiently, I'd recommend a different approach to make sure you rid the entire data set (prior to creation of a narrower set containing only relevant features). The reason this is important is because it's simulating what working with limited, poor-quality data is like in the real world.

FranciscoRMendes commented 3 years ago

I have done the exact same process outlined above in R and i get the exact same results as well. Running the below code chunk dat = dat[rowSums(is.na(dat))==0,] min(rowSums(dat=='__NA__')) gives an output of 3, meaning each row has at least 3 'NA' values.

pdwaggoner commented 3 years ago

Give this a try

anes <- read_csv(here("data", "anes_pilot_2016.csv")) %>% 
  drop_na()
FranciscoRMendes commented 3 years ago

Dear Professor

It still drops the entire data set, I think each row has at least 3 "NA" values. min(rowSums(anes=='__NA__')) gives a value of 3.

Not sure what is going on here, I am using baseR commands.

pdwaggoner commented 3 years ago

Short of answering the questions for you and other than the direction and code I've shared, my best advice would be to progress with the question based on what you have, whether that's a smaller or larger data set than what I have described here and in the problem set. And then justify your selections and process. It's better to have something than nothing. So keep working with it, and trying some different options (e.g., hunt for different indicators of what may reflect a missing case, like 998, _NA_, NA, ., and so on)), and do your best.

FranciscoRMendes commented 3 years ago

Dear Professor

I have found the issue, read_csv from tidyverse makes some arbitrary assumptions and drops a few columns entirely resulting in the <100 obs obtained. I do not think this is right as the raw data is quite different. But yes, I will proceed in this direction.

Thanks Franco

bowen-w-zheng commented 3 years ago

Hi Professor, I have the same problem as Francisco. One problem is that there are two columns that contain just _NA_. If if we just convert those to NA and drop rows containing NA, we will drop all cases. data %>% select_if(function(col) sum((col) == "__NA__") == nrow(data)) %>% names().

pdwaggoner commented 3 years ago

Hi all - see my response to the bigger issue at play here, over in #2 .