jsoma / data-studio-projects

12 stars 18 forks source link

[Project] What Patents Reveal about Facebook, Inc. #203

Open christina10211 opened 6 years ago

christina10211 commented 6 years ago

Pitch

Patents are the new gold for the tech industry, and thus always an invisible battlefield among the tech giants such as Amazon, Facebook,and Google. However, patents are usually neglected by the public because few people have the interest and time to go over all the patent files full of jargons and sci-fi-type-of-vocabularies. I believe that patents can say a lot about a company, especially its shifting business priorities overtime and even the trend of the whole tech industry.

Summary

For my project, I will collect all Facebook's granted patents information, including patent title, number, abstract, filed date, granted date, assignee, and inventors, and then conduct a text analysis on the titles and abstracts to find the shifting keywords in the patents over time. I will also map the number of patents Facebook are granted (not including pending patents) each year. I might also look into the Amazon patents data and do a comparison between these two tech giants.

Some ideas on exploring my dataset/bringing in new datasets:

-Comparison between Facebook's patents and Amazon's patents.

Details

Possible headline(s): What Patents Reveal about Facebook, inc.

Data set(s): https://patents.justia.com/company/facebook?list=patents

Code repository:

Possible problems/fears/questions:

My main challenge will be how to properly conduct the text analysis that we just learned in algorithm class and to categorize the patents by analyzing the abstract and patent title. I went over about 10 patents and there are a huge number of tech jargons which seem to be focusing on quite different fields. For now, the only thing that pops into my mind is to count the word frequency in each year but I am just worried about how to identify the pattern/trend from the word counts (since now I haven't successfully scraped the page yet, I couldn't conduct any further analysis so maybe we will see in few days how the text analysis goes...)

Work so far

For now, I am still working on the scraping part. The page is a little bit messy so it took me a while to figure out the structure. My ideal keywords analysis will be like this:

inspiration_keywords

Checklist

This checklist must be completed before you submit your draft.

angelareplica commented 6 years ago

Excellent topic! Focusing on Facebook's patents will be really interesting, especially since the company's stock value has recently plummeted. This seems like a challenging project, and I'm looking forward to seeing the analyses and visualizations you come up with.

benbitoun commented 6 years ago

Respect for you wanting to do that and trying to apply what we've just learned (and I still don't really understand) on a project. Concerning the topic you said it yourself: "However, patents are usually neglected by the public because few people have the interest and time."

Yup, that's right. And I'm no better. I better don't tell you our click numbers when we write about tech stuff. People are just not interesting in the stuff behind. So I think one of the many challenges you face is figuring out how you wanna tell this story so that people understand why that stuff you found out is important and they SHOULD care about it and read it. I'm curious about the next steps.

castorsia commented 6 years ago

Amazing choice for sleuthing and avidly awaiting your findings! If you need a bit of context you can look up Pando.com, a small, gutsy outlet that covers the Silicon Valley.

christina10211 commented 6 years ago

Update

Your project content: images/words/etc

num_patents

I did an analysis on patents titles by 2 methods: word frequency and tf-idf:

screen shot 2018-08-01 at 14 49 04 screen shot 2018-08-01 at 14 49 20

I also change the tokenize method to phrase and did an tf-idf analysis:

screen shot 2018-08-01 at 14 49 55

Next step:

  1. I added up the vectors for each year's patents and one huge limitation of doing this is that I got social as the highest scores words for almost every year because if its score for each document is low, if it appears in each document and I add them up, the scores become higher. Although it makes sense that Facebook, as a social network company, have social in most of its documents, I was wondering if I can do it in another way to determine which results if better. I decided to glue all docs of same year together and do an analysis later. Let's see how the results will change?

  2. I also scraped the abstract of each patent, and I will do text analysis on these, too, use both word and phrase tokenizer.

Problems/Questions

There is one big limitation in my dataset and it is kind of important in influencing my story: the 3000+ patents in my dataset did not include patents that are acquired by Facebook from other company, and after reading some news stories, i figured that Facebook actually bought many patents from companies like IBM, AOL, and AT&T, and I feel that excluding the patents they bought may affect my angle on this story. I will try to find a more detailed and complete portfolio of Facebook's patents and maybe I will redo my work later.

Checklist

christina10211 commented 6 years ago

Update

Your project content: images/words/etc

main_class_main_class

electricity_physics_electricity

electricity_physics_physics

Any changes in direction or topic?

I've spent way too much time in scraping and cracking jargons in the patent worlds. After finishing keyword analysis, I found that each patent has its own super detailed classifications, and I think that classifications might provide a more systematic angle into Facebook's business strategies, so i spent some time scraping Google patents (I still haven't figured out the USPTO database so I go with the google patents instead).

Problems/Questions

I am struggling with the database a lot. USPTO seems like the most reliable website to get the information, but I can't figure out how to retrieve complete data it. Google patents is good, but sometimes one thing is patented twice in different years (which I still haven't figured out why...) so there's some double counting of patents in my database. www.justia.com is a neat website and I scraped my first set of data from it, but the downside is it didn't list the detailed classifications of the patents so I have to use the patent number I scraped from justia.com to get the classification from google patents. I was just wondering if there is a better way to deal with patent database...I can't guarantee my dataset right now is complete and accurate since I got my data from different websites and there isn't anumber published online about the exact number of patents Facebook owned (since there are many patents acquisition from other companies and the records at USPTO are just so messy)

Another big issue I am having right now is how to visualize the keywords I got from the tf-idf analysis. I have a list of 20 keywords in its patents for the past 14 years, and i am thinking about trying a histogram of words and frequency?

Checklist

christina10211 commented 6 years ago

Update

Your project content: images/words/etc

Facebook Patent Keyphrases

keywords

Any changes in direction or topic?

No...Still in the progress of figuring out my angle in this story...

Checklist

christina10211 commented 6 years ago

Update

Your project content: images/words/etc

I made an annotation in the first chart: num_patents

I just realized that i used beige as the background color of only one graph and it looks quite inconsistent with the color scheme of the rest, so i changed it to a white background too. main_class2_main_class

This is a tricky one, I've spent some time designing and redesigning the keyphrase and the thing that I came up with is to add a timeline as a reference for readers and i also reorganize the keyphrase orders so that people hopefully can read it vertically following the dot line and color blocks (but not sure if it works...) keywords_colored_combo

Any changes in direction or topic?

Nope.

Checklist