Closed k-partha closed 3 years ago
Check your email; this has been addressed. Regardless of differences between R and Python, omitting every row with a missing value will result in ~100 observations remaining. So if canned functions aren't handling NA's efficiently, I'd recommend a different approach to make sure you rid the entire data set (prior to creation of a narrower set containing only relevant features). The reason this is important is because it's simulating what working with limited, poor-quality data is like in the real world.
I have done the exact same process outlined above in R and i get the exact same results as well.
Running the below code chunk
dat = dat[rowSums(is.na(dat))==0,] min(rowSums(dat=='__NA__'))
gives an output of 3, meaning each row has at least 3 'NA' values.
Give this a try
anes <- read_csv(here("data", "anes_pilot_2016.csv")) %>%
drop_na()
Dear Professor
It still drops the entire data set, I think each row has at least 3 "NA" values.
min(rowSums(anes=='__NA__'))
gives a value of 3.
Not sure what is going on here, I am using baseR commands.
Short of answering the questions for you and other than the direction and code I've shared, my best advice would be to progress with the question based on what you have, whether that's a smaller or larger data set than what I have described here and in the problem set. And then justify your selections and process. It's better to have something than nothing. So keep working with it, and trying some different options (e.g., hunt for different indicators of what may reflect a missing case, like 998
, _NA_
, NA
, .
, and so on)), and do your best.
Dear Professor
I have found the issue, read_csv from tidyverse makes some arbitrary assumptions and drops a few columns entirely resulting in the <100 obs obtained. I do not think this is right as the raw data is quite different. But yes, I will proceed in this direction.
Thanks Franco
Hi Professor,
I have the same problem as Francisco. One problem is that there are two columns that contain just _NA_. If if we just convert those to NA and drop rows containing NA, we will drop all cases.
data %>% select_if(function(col) sum((col) == "__NA__") == nrow(data)) %>% names()
.
Hi all - see my response to the bigger issue at play here, over in #2 .
I am unable to obtain the median value listed, probably because of differences between how R and Python treat NA values. After coding NA as an NA value, Python drops the entire table. If it is excluded, 1147 observations remain. After setting a threshold of 590 observations for the dropna() function, 82 values can be obtained - but the median of these values does not equate to 39.5. Setting the threshold either one below or above yields an incorrect number of observations. On closer inspection, excluding '998' as an NA value yields 88 observations - but even this dataset does not yield a median of 39.5. (Note that including 998 as an NA value is necessary as it appears in ftobama.) It appears that this discrepancy is probably due to minute differences in how Python and R treat NA values. How should I proceed from here?