Closed lpuettmann closed 3 years ago
I analyzed the "tokens" (basically word stems) in the title, abstract and patent body of our manually coded sample. Here's how that looks:
Nothing spectacular here. It all looks reasonable.
Here's also an update to the plan of drack I drew before:
Again, nothing spectacular. We're making progress. :+1:
OK, I made some nice progress: I've extracted the incidence matrix :heavy_check_mark:, compiled a table with absolute frequencies of tokens in both sets of documents :heavy_check_mark: and now implemented the mutual information statistic for every token :heavy_check_mark:. That criterion is probably the best way of identifying features (≈variables). And these features we will use to differentiate all other patents. (That could still happen through a heuristic keyword algorithm, or through Naive Bayes, or a fancy machine learning algorithm).
Here comes the magic: The 30 best words that most strongly signal that a patent is an automation patent are:
The [body]
means that the phrase showed up in the body of the patent. Only the phrase "system" makes the cut in both the title and the abstract. But don't worry about that, we can also count all tokens for the patents together or else handle that.
So we see that we were right in concentrating on the phrase "automat". (Or we just classified those that contain that phrase ...) But the other tokens are really interesting: output, signal, input, transmit, retriev, and so on wouldn't have been immediately obvious, but seem to work well.
But that list still contained all those patents we ruled out, like pharmaceutical patens, and I haven't yet made use of our "uncertain" classification. So we will get more exact than that.
Going forward we could, for example, focus on the 200 most highly rated tokens, and use fancy algorithms on only those.
We can also now ask this system other questions: which words signal "uncertain" classifications? Which words signal manual vs. cognitive automatons? Which words signal chemical or pharmaceutical patents?
I think we just made good progress :tada:
I constructed a merged dictionary of tokens of the highest ranked (according to the mutual information criterion) of 50 title tokens, 200 abstract tokens and 500 patent body tokens.
This yields a list of 606 unique tokens in that list. I suggest to call these our features and to base our classification of the rest of the patents on this list of tokens.
I tokenized our original self-selected search dictionary (you know, automat
, robot
, circuit
, and so on) and compared that to the new list of features. Terms that show up in both lists (set difference) are here and tokens that are in both lists (intersection) are here. Most notably robot
is not important and neither are arm
, autonom
, engine
and some others. Important tokens that we also looked for are automat
, comput
, detect
and some others. But note that the list of tokens that the information criterion deems important comes from our 483 manually coded patents (chemical and pharmaceutical not included). So it's actually possible that robot
would be a good term to look for, but just doesn't show up in any of those patents that we happened to code manually.
I think we should also look for those few extra keywords in our patents. We then end up with 623 tokens. We can always just not look at them later and if they don't show up in our manual sample, then they have no predictive value. But maybe they'll show up later.
Anyway, we should be set to search for this much larger dictionary of words on all the U.S. patents. I'll program this up and set it up on Bayer's external computer. Hopefully it won't take too long.
It took about a week computing power on my (4 cores) and your computer (4 cores) plus Bayer's computer (6 cores). Only one or two years are still missing. Then we will have all 623 words for all patents and can make a classification using the Naive Bayes algorithm. I made a number of checks and the keyword matches look plausible so far. In particular, for the subset of tokens that we searched for before are the same as now, so that's good.
Ok, so the share of patents containing "automat" anywhere in the patent text still looks plausible. :+1:
(Here also as pdf.)
Awesome
On Mon, 29 Feb 2016 04:51:06 -0800 Lukas Püttmann notifications@github.com wrote:
Ok, so the share of patents containing "automat" anywhere in the patent text still looks plausible. :+1:
(Here also as pdf.)
Reply to this email directly or view it on GitHub: https://github.com/lpuettmann/patent-automat/issues/16#issuecomment-190197113
For the Naive Bayes classifier, we need the conditional probability of some token (say "automat") being in either class (automation patent or non-automation patent). [There's also some smoothing going on, see chapter 13. Here they are (Update: peaks labelled):
Update 01.03.2016: Make this figure for the three parts of the patents.
I've applied the Bernoulli Naive Bayes (see explanation below) on all our patents with the new 623 words.
(Important: The pharmaceutical and chemical patents are still in there.)
We then get the following shares of automation patents per year: (Here is the pdf).
I'm very happy that we still see a similar kind of trend and that the shares look plausible. The trend is up much more extreme, though, with more than 60 percent automation patents in the end of the sample. Here are some values:
Year | # autom. patents | # patents | Share autom. patents (%) |
---|---|---|---|
1976 | 16531 | 70194 | 23.6 (minimum value) |
1977 | 15685 | 65215 | 24.0 |
.. | .. | .. | .. |
1999 | 65244 | 153591 | 42.5 |
.. | .. | .. | .. |
2014 | 192976 | 301643 | 64.0 (maximum value) |
2015 | 37565 | 59202 | 63.5 |
Absolute number of automation patents per week: (Here is the pdf).
Share of patents classified as automation patents: (Here is the pdf).
Bernoulli Naive Bayes means that we classify based on tokens appearing once and do not put information into some token appearing several times. This is known to work better for shorter documents and to function a bit differently than the multinomial Naive Bayes, in which multiple occurences matter. I have - for know - picked this version of the algorithm, because in our experience this approach worked well. "automat" appearing once or 10 times wasn't usually much different.
I remade the figure from above a bit prettier and with scatterplots:
I'm a bit worried why the two conditional probabilities are so highly correlated:
Part | Correlation |
---|---|
Title | 0.83 |
Abstract | 0.77 |
Body | 0.89 |
It seems intuitively wrong to me, as the tokens should contain information about whether the patent is in one of the two categories. I need to read up a bit on that. I was actually quite careful in picking these tokens, as I used an information criterion which explicitly punishes for tokens that appear equally in both classes.
Ok, I think I know what's going on. We still have a lot of tokens in there that appear in many of the patents, such as draw
, embodi
, fig
, gener
, background
, detail
, oper
. They all have conditional probabilities of being in either class more than 0.78. That's mainly because they appear many times, all of them more than in more than 400 out of our 483 patents (after deleting pharma+chemical).
So apparently the information criterion was still high for them, so that we included them in the top 600 tokens.
Here is a matrix of scatterplots of stats on these tokens against each other (all for tokens in the patent body):
(Here's the pdf)
On the top left (1,1) we have the scatterplot from above that irritated me with the high correlation of the cond. prob.s of 0.88. We can see that the high cond. prob.s are particulary high for tokens that simply appear more often (see the right column, or subplots (1,3) and (2,3)).
But it's interesting to see that they are not necessarily the patents with the high mutual information criterion values (subplots (1,2), (2,2) and (3,3)).
I think that's a good sign and we can just continue on with our classifications.
Here I am assembling my thoughts and documenting my step on classifying our 5 mio. patents into automation and non-automation patents using our manually labelled sample.
We can use our sample to:
In what follows, I basically follow the desciption in chapter 13 of Manning et al. (2008).
Right now, our sample basically looks like this (omitting some information like the subclassifications
cognitive
,manual
, thecomments
, thecoderID
and thecodingDate
):T
.(Where does the
3T
come from? Every patent has a title, abstract and body, so we have an entry for title for every word.)[to be continued ...]