Closed aronwc closed 9 years ago
Age brackets:
under 18 18-24 25-34 35-44 45+
For gender:
In cell [10] of the referenced age notebook above, the n_alive column tells us, e.g., how many people named Violet born in 1909 are still alive (8.6). If we divide that by the sum of the n_alive column, this will tell us the fraction of people named Violet who are 106 years old (2015-1909).
We need to compute this fraction for each name, for the age brackets defined above.
Once we do this for each user in a group (e.g., all users who tweeted positively about ecigs), we can take the average to get the age distribution of these users.
Plot below shows the distribution of genders for positive and negative tweets.
We have 744 different names (first names), and for each name we have a .pkl file with a panda table like the figure below.
Where each line represents a bracket, where each column is the result of the sum of the years. B1: under 18 B2: 18-24 B3: 25-34 B4: 35-44 B5: 45+
Fraction is defined as: (d['n_alive']/d['n_alive'].sum())*100 E.g.: this will tell us the fraction of people named Violet who are 106 years old (2014-1909)
I got the fraction and took the average by month considering just positive tweets and unique users (744). The plot is below. Is that what you explained on Monday?
Yes, this is what I had in mind. A couple things that surprise me:
On Tue, Jul 21, 2015 at 7:31 PM, ElaineResende notifications@github.com wrote:
We have 744 different names (first names), and for each name we have a .pkl file with a panda table like the figure below.
[image: image] https://cloud.githubusercontent.com/assets/8547396/8815018/28b72ee8-2fd9-11e5-8904-4a430aec7c89.png
Where each line represents a bracket, where each column is the result of the sum of the years. B1: under 18 B2: 18-24 B3: 25-34 B4: 35-44 B5: 45+
Fraction is defined as: (d['n_alive']/d['n_alive'].sum())*100 E.g.: this will tell us the fraction of people named Violet who are 106 years old (2014-1909)
I got the fraction and took the average by month considering just positive tweets and unique users (744). The plot is below. Is that what you explained on Monday?
[image: image] https://cloud.githubusercontent.com/assets/8547396/8815193/65f2ff6a-2fdb-11e5-9fe4-9dc50cdd7bd8.png
— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-123516740 .
744 is the set of names which are not classified as "unknown gender".
Considering unique names (just first name):
Overall tweets we have:
I see. So you're using the same set of names we used to infer gender to infer age?
No, it is different. First, I thought it was related to the set of names from the inference of gender (because we have exactly 744 names), but is is not correlated.
To infer gender the data is from these links: males_url = 'http://www2.census.gov/topics/genealogy/1990surnames/dist.male.first' females_url = 'http://www2.census.gov/topics/genealogy/1990surnames/dist.female.first'
To infer age is from: http://www.ssa.gov/oact/babynames/names.zip
Hmm...this seems odd. I count around 20k unique names in http://www.ssa.gov/oact/babynames/names.zip. It is surprising to me that only 744 match user profiles in our data. Can you please double check?
Yes, for me it seems odd too. I just restarted my laptop, I am going to double check that.
The plot below refers to the correct number of names now. As you can see it didn't change much. I also used a little different brackets, but it is basically same as this one. Do you want to use other brackets or to eliminate one?
Was one of the brackets 45-54? (ignoring > 54)
On Fri, Jul 24, 2015 at 12:21 PM, ElaineResende notifications@github.com wrote:
The plot below refers to the correct number of names now. As you can see it didn't change much. I also used a little different brackets, but it is basically same as this one. Do you want to use other brackets or to eliminate one?
[image: download] https://cloud.githubusercontent.com/assets/8547396/8879952/b06e7544-31fa-11e5-8d3c-17e59c41c582.png
— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-124585124 .
The last bracket is 45+. I will change the brackets and run again without 54+.
I have added a new bracket (B6) for 54+ and at the time of plotting I just plotted B1 to B5.
Positive tweets:
Negative tweets:
Positive + negative tweets:
Great. It looks like the absolute age estimates are not terribly accurate. We should instead focus on the differences in the estimates, which should be more reliable.
We can do this by plotting the trend of differences in age estimates. E.g., if March has 30% <18 and April has 40% < 18, then the plotted value for April would be 10% (40-30). Thus, each value represents the absolute change in that age bracket in the past month. (This also means the first value will not be plotted, since there is no prior month to compare to).
-Aron
On Sat, Jul 25, 2015 at 9:48 AM, ElaineResende notifications@github.com wrote:
I have added a new bracket (B6) for 54+ and at the time of plotting I just plotted B1 to B5.
Positive tweets: [image: name1] https://cloud.githubusercontent.com/assets/8547396/8886110/50006d3a-322b-11e5-81a4-87827d0c3f1f.png
Negative tweets: [image: names_neg] https://cloud.githubusercontent.com/assets/8547396/8889881/fe03ea8e-32b1-11e5-98e6-eedeb6f318ea.png
Positive + negative tweets: [image: names_pos_neg] https://cloud.githubusercontent.com/assets/8547396/8889882/070b88b2-32b2-11e5-9d9b-5b601c726ef0.png
— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-124851958 .
Something like that?
Something doesn't look quite right. E.g., in your prior graph, 18-24 decreases each of the first 4 months, which suggests that its value should be negative in the first 3 months of the new plot.
I am sorry, this graph was done with negative tweets. Now, I am running for positive ones.
My prior statement seems to also hold for negative tweets...
Yes, sorry, it was my mistake in the code.
Now, these are for positive
Looks better! My remaining concern -- It looks like in Jun 2013 all age brackets percentages increased, which shouldn't really happen, since some bracket must go down for the others to go up. I assume this is because we've eliminated the >54 bracket. Can you please add it back in?
Considering > 54 for positive:
Difference:
Looks great. Last thing (I hope!): The variation of 45+ looks large because its absolute values are large (e.g., ~50% versus ~10% for 18-24). Instead, let's plot the % change. E.g.
Sep: 30% Oct: 40% %change = (.40-.30) / .30 = +.33
Sep: 40% Oct: 30% %change = (.30-.40) / .40 = -.25
Note that the denominator is the value from the preceding month.
On Mon, Jul 27, 2015 at 12:28 PM, ElaineResende notifications@github.com wrote:
Considering > 54 for positive:
[image: pos_45] https://cloud.githubusercontent.com/assets/8547396/8912835/e6512dca-345a-11e5-81bb-cf024946ca9b.png
Difference: [image: pos_45 diff] https://cloud.githubusercontent.com/assets/8547396/8912834/e64b6a7a-345a-11e5-9f52-490c88abf5ed.png
— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-125280820 .
Good!
On Mon, Jul 27, 2015 at 1:18 PM, ElaineResende notifications@github.com wrote:
[image: pos_45 diffnew] https://cloud.githubusercontent.com/assets/8547396/8913365/49a0dcf6-345e-11e5-966d-15ec82c16043.png
— Reply to this email directly or view it on GitHub https://github.com/tapilab/ecig-classify/issues/10#issuecomment-125294220 .
Negative considering > 54
http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/ https://github.com/samsieber/us-actuarial
Using first name of the "real_name" field, infer age and gender distributions for each month, restricting to positive tweets.