Infer gender/age of positive tweets per month.

aronwc commented 9 years ago

http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/ https://github.com/samsieber/us-actuarial

Using first name of the "real_name" field, infer age and gender distributions for each month, restricting to positive tweets.

aronwc commented 9 years ago

Age brackets:

under 18 18-24 25-34 35-44 45+

aronwc commented 9 years ago

Look here: https://github.com/ramnathv/agebyname_py/blob/master/index.ipynb

aronwc commented 9 years ago

For gender:

Take top 233 male names and top 525 female names from census (these numbers are chosen b/c they reflect 75% of the population)
Remove names that appear on both lists
Use these lists to classify each user in our ecig data by gender
- also split by users who express pos/neg sentiment

aronwc commented 9 years ago

In cell [10] of the referenced age notebook above, the n_alive column tells us, e.g., how many people named Violet born in 1909 are still alive (8.6). If we divide that by the sum of the n_alive column, this will tell us the fraction of people named Violet who are 106 years old (2015-1909).

We need to compute this fraction for each name, for the age brackets defined above.

Once we do this for each user in a group (e.g., all users who tweeted positively about ecigs), we can take the average to get the age distribution of these users.

ElaineResende commented 9 years ago

Plot below shows the distribution of genders for positive and negative tweets.

genders by sentiment

ElaineResende commented 9 years ago

We have 744 different names (first names), and for each name we have a .pkl file with a panda table like the figure below.

Where each line represents a bracket, where each column is the result of the sum of the years. B1: under 18 B2: 18-24 B3: 25-34 B4: 35-44 B5: 45+

Fraction is defined as: (d['n_alive']/d['n_alive'].sum())*100 E.g.: this will tell us the fraction of people named Violet who are 106 years old (2014-1909)

I got the fraction and took the average by month considering just positive tweets and unique users (744). The plot is below. Is that what you explained on Monday?

aronwc commented 9 years ago

Yes, this is what I had in mind. A couple things that surprise me:

Why are there so few unique names? (744 out of hundreds of thousands of users?)
Why is the 45+ crowd so large? Perhaps because this bracket is the biggest? We should think about how to make the bracket sizes more comparable.

On Tue, Jul 21, 2015 at 7:31 PM, ElaineResende notifications@github.com wrote:

We have 744 different names (first names), and for each name we have a .pkl file with a panda table like the figure below.

[image: image] https://cloud.githubusercontent.com/assets/8547396/8815018/28b72ee8-2fd9-11e5-8904-4a430aec7c89.png

Where each line represents a bracket, where each column is the result of the sum of the years. B1: under 18 B2: 18-24 B3: 25-34 B4: 35-44 B5: 45+

Fraction is defined as: (d['n_alive']/d['n_alive'].sum())*100 E.g.: this will tell us the fraction of people named Violet who are 106 years old (2014-1909)

I got the fraction and took the average by month considering just positive tweets and unique users (744). The plot is below. Is that what you explained on Monday?

[image: image] https://cloud.githubusercontent.com/assets/8547396/8815193/65f2ff6a-2fdb-11e5-9fe4-9dc50cdd7bd8.png

— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-123516740 .

ElaineResende commented 9 years ago

744 is the set of names which are not classified as "unknown gender".

Considering unique names (just first name):

Tweets don't have the gender identified we have 178096 unique names
Tweets have gender identified we have 744 unique names

Overall tweets we have:

259016 tweets gender != unknown
733617 tweets had the gender identified

aronwc commented 9 years ago

I see. So you're using the same set of names we used to infer gender to infer age?

ElaineResende commented 9 years ago

No, it is different. First, I thought it was related to the set of names from the inference of gender (because we have exactly 744 names), but is is not correlated.

To infer gender the data is from these links: males_url = 'http://www2.census.gov/topics/genealogy/1990surnames/dist.male.first' females_url = 'http://www2.census.gov/topics/genealogy/1990surnames/dist.female.first'

To infer age is from: http://www.ssa.gov/oact/babynames/names.zip

aronwc commented 9 years ago

Hmm...this seems odd. I count around 20k unique names in http://www.ssa.gov/oact/babynames/names.zip. It is surprising to me that only 744 match user profiles in our data. Can you please double check?

ElaineResende commented 9 years ago

Yes, for me it seems odd too. I just restarted my laptop, I am going to double check that.

ElaineResende commented 9 years ago

The plot below refers to the correct number of names now. As you can see it didn't change much. I also used a little different brackets, but it is basically same as this one. Do you want to use other brackets or to eliminate one?

download

aronwc commented 9 years ago

Was one of the brackets 45-54? (ignoring > 54)

On Fri, Jul 24, 2015 at 12:21 PM, ElaineResende notifications@github.com wrote:

The plot below refers to the correct number of names now. As you can see it didn't change much. I also used a little different brackets, but it is basically same as this one. Do you want to use other brackets or to eliminate one?

[image: download] https://cloud.githubusercontent.com/assets/8547396/8879952/b06e7544-31fa-11e5-8d3c-17e59c41c582.png

— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-124585124 .

ElaineResende commented 9 years ago

The last bracket is 45+. I will change the brackets and run again without 54+.

ElaineResende commented 9 years ago

I have added a new bracket (B6) for 54+ and at the time of plotting I just plotted B1 to B5.

Positive tweets: name1

Negative tweets: names_neg

Positive + negative tweets: names_pos_neg

aronwc commented 9 years ago

Great. It looks like the absolute age estimates are not terribly accurate. We should instead focus on the differences in the estimates, which should be more reliable.

We can do this by plotting the trend of differences in age estimates. E.g., if March has 30% <18 and April has 40% < 18, then the plotted value for April would be 10% (40-30). Thus, each value represents the absolute change in that age bracket in the past month. (This also means the first value will not be plotted, since there is no prior month to compare to).

-Aron

On Sat, Jul 25, 2015 at 9:48 AM, ElaineResende notifications@github.com wrote:

I have added a new bracket (B6) for 54+ and at the time of plotting I just plotted B1 to B5.

Positive tweets: [image: name1] https://cloud.githubusercontent.com/assets/8547396/8886110/50006d3a-322b-11e5-81a4-87827d0c3f1f.png

Negative tweets: [image: names_neg] https://cloud.githubusercontent.com/assets/8547396/8889881/fe03ea8e-32b1-11e5-98e6-eedeb6f318ea.png

Positive + negative tweets: [image: names_pos_neg] https://cloud.githubusercontent.com/assets/8547396/8889882/070b88b2-32b2-11e5-9d9b-5b601c726ef0.png

— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-124851958 .

ElaineResende commented 9 years ago

Something like that?

age_diff

aronwc commented 9 years ago

Something doesn't look quite right. E.g., in your prior graph, 18-24 decreases each of the first 4 months, which suggests that its value should be negative in the first 3 months of the new plot.

ElaineResende commented 9 years ago

I am sorry, this graph was done with negative tweets. Now, I am running for positive ones.

aronwc commented 9 years ago

My prior statement seems to also hold for negative tweets...

ElaineResende commented 9 years ago

Yes, sorry, it was my mistake in the code.

Now, these are for positive age_diff

aronwc commented 9 years ago

Looks better! My remaining concern -- It looks like in Jun 2013 all age brackets percentages increased, which shouldn't really happen, since some bracket must go down for the others to go up. I assume this is because we've eliminated the >54 bracket. Can you please add it back in?

ElaineResende commented 9 years ago

Considering > 54 for positive:

pos_45

Difference: pos_45 diff

aronwc commented 9 years ago

Looks great. Last thing (I hope!): The variation of 45+ looks large because its absolute values are large (e.g., ~50% versus ~10% for 18-24). Instead, let's plot the % change. E.g.

Sep: 30% Oct: 40% %change = (.40-.30) / .30 = +.33

Sep: 40% Oct: 30% %change = (.30-.40) / .40 = -.25

Note that the denominator is the value from the preceding month.

On Mon, Jul 27, 2015 at 12:28 PM, ElaineResende notifications@github.com wrote:

Considering > 54 for positive:

[image: pos_45] https://cloud.githubusercontent.com/assets/8547396/8912835/e6512dca-345a-11e5-81bb-cf024946ca9b.png

Difference: [image: pos_45 diff] https://cloud.githubusercontent.com/assets/8547396/8912834/e64b6a7a-345a-11e5-9f52-490c88abf5ed.png

— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-125280820 .

ElaineResende commented 9 years ago

pos_45 diffnew

aronwc commented 9 years ago

Good!

On Mon, Jul 27, 2015 at 1:18 PM, ElaineResende notifications@github.com wrote:

[image: pos_45 diffnew] https://cloud.githubusercontent.com/assets/8547396/8913365/49a0dcf6-345e-11e5-966d-15ec82c16043.png

— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-125294220 .

ElaineResende commented 9 years ago

Negative considering > 54 neg_45 diffnew

tapilab / protest

Infer gender/age of positive tweets per month. #10