Apply machine learning to classify patents

lpuettmann commented 8 years ago

Here I am assembling my thoughts and documenting my step on classifying our 5 mio. patents into automation and non-automation patents using our manually labelled sample.

We can use our sample to:

extract the words that correlate significantly with being in either of these groups.
to "train" (estimate) a classifier that we can apply to all of our patents afterward. The first approach will be the Naive Bayes classifier, which is very simple. In a second step, we might try more fancy things like Support Vector Machines or Random Forests.

In what follows, I basically follow the desciption in chapter 13 of Manning et al. (2008).

Right now, our sample basically looks like this (omitting some information like the subclassifications cognitive, manual, the comments, the coderID and the codingDate):

Index	Year	PatNr	indicAutomat	highlyUncertain
1	1976	3924419	0	0
2	1976	3835599	0	0
...	...	...	...	...
560	2015	8728717	1	0

The first difficulty is to extract the complete clear strings of the patents. The nice thing is, that we've already saved all the files and assembled a patent index, which contains the information on which patents to find in which files and on what lines of the weekly files. We've also already written code that extracts the title, abstract and body of the patents.
But previously we got away with leaving some of the mark-up garbage in there. This will cause problems now. There should be close to no mark-up in the title and abstract strings, but a lot in the body. Hopefully we'll get around that by assembling a dictionary of known markup like: "PATN", "<DOCTYPE" and so on.
Once we have the pure strings (separately for the three parts of the patent), we convert all strings to lower-case.
Then we cut the string into separate words using as separators the space.
Then we delete all known stop words (minor words that we know don't have much content meaning), such as "and", "it", "a", "the". We can use a standard dictionary for that.
Then we use the Porter stemmer, a standard algorithm that converts words to their stems. So basically convert "automatically" and "automation" to "automat".
Collect all unique words that are used in any of the 560 patents in a dictionary. Call the length of this dictionary T.
Make a gigantic matrix of 1 and 0 for every document. 1: term appears at least one. 2: term does not appear. [Check this: we might want the frequency of occurrences as well]

	word1	word2	...	word3T
doc1	1	0	...	0
doc2	1	1	...	0
...	...	...	...	...
doc560	0	0	...	1

(Where does the 3Tcome from? Every patent has a title, abstract and body, so we have an entry for title for every word.)

This matrix might cause memory problems, so we convert to the index of inverse term frequencies:


word1	doc1	doc2
word2	doc2
...	...	...	...	...
word3T	doc560

Here will already be able to see some summary statistics. We will find out what the most frequent terms are in either group.
Then we choose a number of these terms as "features" (variables) in our further analysis. We will calculate the mutual information and the χ2-statistic for all words.
Select the once with the highest values. There's a long discussion on how to do this well, so a bit of reading here.
"Learn" (estimate) the Naive Bayes classifier on the selected features.
Classify all our 5 mio. patents according with our "trained" classifier.

[to be continued ...]

lpuettmann commented 8 years ago

I analyzed the "tokens" (basically word stems) in the title, abstract and patent body of our manually coded sample. Here's how that looks:

hist_nr_tok

Nothing spectacular here. It all looks reasonable.

Here's also an update to the plan of drack I drew before: whiteboard

Again, nothing spectacular. We're making progress. :+1:

lpuettmann commented 8 years ago

OK, I made some nice progress: I've extracted the incidence matrix :heavy_check_mark:, compiled a table with absolute frequencies of tokens in both sets of documents :heavy_check_mark: and now implemented the mutual information statistic for every token :heavy_check_mark:. That criterion is probably the best way of identifying features (≈variables). And these features we will use to differentiate all other patents. (That could still happen through a heuristic keyword algorithm, or through Naive Bayes, or a fancy machine learning algorithm).

Here comes the magic: The 30 best words that most strongly signal that a patent is an automation patent are:

mutual_information_tokens

The [body]means that the phrase showed up in the body of the patent. Only the phrase "system" makes the cut in both the title and the abstract. But don't worry about that, we can also count all tokens for the patents together or else handle that.

So we see that we were right in concentrating on the phrase "automat". (Or we just classified those that contain that phrase ...) But the other tokens are really interesting: output, signal, input, transmit, retriev, and so on wouldn't have been immediately obvious, but seem to work well.

But that list still contained all those patents we ruled out, like pharmaceutical patens, and I haven't yet made use of our "uncertain" classification. So we will get more exact than that.

Going forward we could, for example, focus on the 200 most highly rated tokens, and use fancy algorithms on only those.

We can also now ask this system other questions: which words signal "uncertain" classifications? Which words signal manual vs. cognitive automatons? Which words signal chemical or pharmaceutical patents?

I think we just made good progress :tada:

lpuettmann commented 8 years ago

I constructed a merged dictionary of tokens of the highest ranked (according to the mutual information criterion) of 50 title tokens, 200 abstract tokens and 500 patent body tokens.

This yields a list of 606 unique tokens in that list. I suggest to call these our features and to base our classification of the rest of the patents on this list of tokens.

I tokenized our original self-selected search dictionary (you know, automat, robot, circuit, and so on) and compared that to the new list of features. Terms that show up in both lists (set difference) are here and tokens that are in both lists (intersection) are here. Most notably robot is not important and neither are arm, autonom, engine and some others. Important tokens that we also looked for are automat, comput, detect and some others. But note that the list of tokens that the information criterion deems important comes from our 483 manually coded patents (chemical and pharmaceutical not included). So it's actually possible that robotwould be a good term to look for, but just doesn't show up in any of those patents that we happened to code manually.

I think we should also look for those few extra keywords in our patents. We then end up with 623 tokens. We can always just not look at them later and if they don't show up in our manual sample, then they have no predictive value. But maybe they'll show up later.

Anyway, we should be set to search for this much larger dictionary of words on all the U.S. patents. I'll program this up and set it up on Bayer's external computer. Hopefully it won't take too long.

lpuettmann commented 8 years ago

It took about a week computing power on my (4 cores) and your computer (4 cores) plus Bayer's computer (6 cores). Only one or two years are still missing. Then we will have all 623 words for all patents and can make a classification using the Naive Bayes algorithm. I made a number of checks and the keyword matches look plausible so far. In particular, for the subset of tokens that we searched for before are the same as now, so that's good.

lpuettmann commented 8 years ago

Ok, so the share of patents containing "automat" anywhere in the patent text still looks plausible. :+1:

automat_newmatches

(Here also as pdf.)

KatjaMa commented 8 years ago

Awesome

On Mon, 29 Feb 2016 04:51:06 -0800 Lukas Püttmann notifications@github.com wrote:

Ok, so the share of patents containing "automat" anywhere in the patent text still looks plausible. :+1:

(Here also as pdf.)

Reply to this email directly or view it on GitHub: https://github.com/lpuettmann/patent-automat/issues/16#issuecomment-190197113

lpuettmann commented 8 years ago

For the Naive Bayes classifier, we need the conditional probability of some token (say "automat") being in either class (automation patent or non-automation patent). [There's also some smoothing going on, see chapter 13. Here they are (Update: peaks labelled): condprob_bernoullinb

Update 01.03.2016: Make this figure for the three parts of the patents.

lpuettmann commented 8 years ago

I've applied the Bernoulli Naive Bayes (see explanation below) on all our patents with the new 623 words.

(Important: The pharmaceutical and chemical patents are still in there.)

Yearly results

We then get the following shares of automation patents per year: nb_autompats_1976-2015 (Here is the pdf).

I'm very happy that we still see a similar kind of trend and that the shares look plausible. The trend is up much more extreme, though, with more than 60 percent automation patents in the end of the sample. Here are some values:

Year	# autom. patents	# patents	Share autom. patents (%)
1976	16531	70194	23.6 (minimum value)
1977	15685	65215	24.0
..	..	..	..
1999	65244	153591	42.5
..	..	..	..
2014	192976	301643	64.0 (maximum value)
2015	37565	59202	63.5

Weekly results

Absolute number of automation patents per week: nb_autompats_weekly_1976-2015 (Here is the pdf).

Share of patents classified as automation patents: nb_autompats_share_weekly_1976-2015 (Here is the pdf).

Explanation Bernoulli Naive Bayes

Bernoulli Naive Bayes means that we classify based on tokens appearing once and do not put information into some token appearing several times. This is known to work better for shorter documents and to function a bit differently than the multinomial Naive Bayes, in which multiple occurences matter. I have - for know - picked this version of the algorithm, because in our experience this approach worked well. "automat" appearing once or 10 times wasn't usually much different.

lpuettmann commented 8 years ago

I remade the figure from above a bit prettier and with scatterplots: cond_prob_tokens_class

I'm a bit worried why the two conditional probabilities are so highly correlated:

Part	Correlation
Title	0.83
Abstract	0.77
Body	0.89

It seems intuitively wrong to me, as the tokens should contain information about whether the patent is in one of the two categories. I need to read up a bit on that. I was actually quite careful in picking these tokens, as I used an information criterion which explicitly punishes for tokens that appear equally in both classes.

lpuettmann commented 8 years ago

Ok, I think I know what's going on. We still have a lot of tokens in there that appear in many of the patents, such as draw, embodi, fig, gener, background, detail, oper. They all have conditional probabilities of being in either class more than 0.78. That's mainly because they appear many times, all of them more than in more than 400 out of our 483 patents (after deleting pharma+chemical).

So apparently the information criterion was still high for them, so that we included them in the top 600 tokens.

Here is a matrix of scatterplots of stats on these tokens against each other (all for tokens in the patent body):

plotmatrix_tstats (Here's the pdf)

On the top left (1,1) we have the scatterplot from above that irritated me with the high correlation of the cond. prob.s of 0.88. We can see that the high cond. prob.s are particulary high for tokens that simply appear more often (see the right column, or subplots (1,3) and (2,3)).

But it's interesting to see that they are not necessarily the patents with the high mutual information criterion values (subplots (1,2), (2,2) and (3,3)).

I think that's a good sign and we can just continue on with our classifications.

lpuettmann / patent-automat