nan in datamatrix - Githubissues

matarhaller commented 8 years ago

the datamatrix has nans in it, which breaks PCA. I'm not completely sure why they are there, but do you think it's reasonable to just replace nans with 0?

juanshishido commented 8 years ago

I think I know what might be going on. Concatenating NaNs with non-NaNs results in NaNs (somewhat related (at about 3:40)). I'm working on a fix now.

juanshishido commented 8 years ago

Maybe.

juanshishido commented 8 years ago

@matarhaller What was the code you had for checking NaNs? I have the data_matrix object and want to check it.

juanshishido commented 8 years ago

Got it :sweat_smile:

>>> np.isnan(data_matrix.todense()).sum()
0

:+1:

matarhaller commented 8 years ago

just to check if anything is nan: np.isnan(datamatrix).any()

or you can do np.where(np.isnan(datamatrix)) to figure out exactly where the nans are

matarhaller commented 8 years ago

@juanshishido You're too speedy!

juanshishido commented 8 years ago

Thanks!

The shape of the matrix is now: (57822, 3429). I don't remember the original dimensions, but it's good now.

I created a new notebook for this in a new branch. I think it might be better to just modify the original. What do you all think?

juanshishido commented 8 years ago

What was happening was that some people did not fill out any essays. So their TotalEssays values were blank. I am returning this instead: return df[df.TotalEssays.str.len() > 0]. I also found that some of those "empty" TotalEssays had a length greater than 0. So I also added this: .apply(lambda x: re.sub('\s+', ' ', x).strip()).

juanshishido commented 8 years ago

A question that stems (NLP joke) from this is, do we want to only use individuals who filled something out for all essays or are partial responses okay (of course, no responses aren't useful)?

matarhaller commented 8 years ago

Good point. Since we have so much data, I'm okay with dropping people that didn't answer all the essays.

jnaras commented 8 years ago

Oh, okay! Sounds good. Happy to drop people who didn't answer and happy to convert to .py files.

juanshishido commented 8 years ago

Great! We'll have to make sure do add that in.

juanshishido commented 8 years ago

4b9355f36e07ca1d49d9f74d156d351ea402f109 fixes this.

juanshishido commented 8 years ago

Decided to move the conversation of NaNs we were having in #7 here.

@jnaras Everything ran and confirmed that np.isnan(data_matrix.todense()).sum() == 0.

With 5fd38b069f9bf5756b6af12a57069f8d61044cf8, I rearranged the imports slightly (and removed the ones we were not using), removed the print statements in filter_vocab and create_data_matrix, added whitespace to the list comprehensions in generate_freqdists and filter_vocab, and changed the formatting for the "Calculating PMI Features" cell.

Thank you!

juanshishido commented 8 years ago

Also, the pickled data is good :+1:

matarhaller commented 8 years ago

So is master fully updated? On Nov 24, 2015 12:42 AM, "Juan Shishido" notifications@github.com wrote:

Decided to move the conversation of NaNs we were having in #7 https://github.com/juanshishido/okcupid/issues/7 here.

@jnaras https://github.com/jnaras Everything ran and confirmed that np.isnan(data_matrix.todense()).sum() == 0.

With 5fd38b0 https://github.com/juanshishido/okcupid/commit/5fd38b069f9bf5756b6af12a57069f8d61044cf8, I rearranged the imports slightly (and removed the ones we were not using), removed the print statements in filter_vocab and create_data_matrix, added whitespace to the list comprehensions in generate_freqdists and filter_vocab, and changed the formatting for the "Calculating PMI Features" cell.

Thank you!

— Reply to this email directly or view it on GitHub https://github.com/juanshishido/okcupid/issues/4#issuecomment-159192838.

juanshishido commented 8 years ago

@matarhaller Yeah. It says jaya is 3 commits ahead of master, but that's because of how I updated master—fetched the jaya branch to get Calculate PMI features.ipynb, update it, and pushed to master.

juanshishido / okcupid

nan in datamatrix #4