kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.31k stars 440 forks source link

Patent-related document categories are ambiguously worded #602

Open kylehigham opened 4 years ago

kylehigham commented 4 years ago

Currently, patent-related documents can be assigned one of the following categories: Publication, Application, Reissued, and Provisional. I think this is an odd selection of terminology, particularly for US documents. There are a couple of reasons for this:

I suggest a change in the way patent documents are categorised so each category only contains the same kind of document in a way that is useful for researchers. I am happy to discuss the different categories that would be useful for different use cases, but for a start, I think the following categories would be useful: 'Patent', 'Application', 'Published Pre-Grant Document', 'Provisional Application', and 'Other'. 'Patent' can be split into utility, design, reissue, plant rights, or even utility models if need be (this last document type could be its own category as well). I think these categories reflect the universal patent prosecution processes across jurisdictions - provisional application establishes priority date (where available), a formal application is then filed, and patents are granted at the other end of the process. Throughout this process, auxiliary documents are generated, including published applications and search reports, which all fit into the Published Pre-Grant Document category.

kermitt2 commented 4 years ago

Hello @kylehigham and thank you very much for raising these questions on the patent bibliographical information model used in Grobid.

OK first I am a bit rusty with the patent stuff, but when I developed the current approach in Grobid, I was still in shape :). The current representation in Grobid is inspired by DocDB, the patent bibliographical master database of the EPO, which is covering 90 different patent authority (used for offering their different patent information products, which are the most widely used in the patent world afaik). So the initial idea was to have something covering all patent systems, not just the US ones for instance.

The main issue I think is considering that the "Publication, Application, Reissued, and Provisional" information are distinct categories, and that a citation receives one of them in a mutually exclusive manner. In the internal GROBID format, these information are flags that applied as a set of features to a citation. So:

The flags are as follow currently:

    private Boolean application = false;
    private Boolean provisional = false;
    private Boolean reissued = false;
    private Boolean plant = false;
    private Boolean design = false;
    private Boolean utility = false;

If I remember well, in the US we have 3 types of patent: utility, design or plant. A "utility patent" is what we call in general a patent in the rest of the world. However, we can have more types of "patents" in other countries. The utility flag here in Grobid corresponds to the "utility models" which are something distinct from a patent (from a "utility patent" in the US) and from a "design patent" (or "design model" in EU). It's confusing of course, but I suppose it's because the patent systems in different countries/regions are using the same term for different legal things, so whatever term we use it will always be ambiguous/confusing from the point of view of one system. What's important is to choose a model that can cover all the types of product variants for all the patent authorities, and that this model can fit something like DocDB which is designed to cover all the patent authority mess.

Given these explanations, does the Grobid format appear less confusing?

I see clearly the problem you raise when I output a unique category in the XML results, I pick-up a type among "application", "provisional", "reissued", "plant", "design" and "publication", because the XML/TEI is expecting a single type. I think the main problem is to select a unique category and we should also keep the "flag" system in the outputted XML.

The categories you propose can also not be mutually exclusive I think, and are overlapping for instance 'Published Pre-Grant Document' would correspond to a A publication, so to an Application. Similarly a reissued patent can be possibly a utility patent, design patent or plant in the US system, which is reissued after defect as a "normal" utility patent/design patent/plant patent variants (I think a "reissued" patent can be at the same time a provisional application - but we are maybe too deep in the particularities of the USPTO system).

What I have not considered is the possibility to cite a search report itself (a WOSA typically), because I have never seen this in a patent publication. It's normally a communication in the patent prosecution that cites this kind of by-products and it's outside the scope of Grobid which is covering the processing of patent document texts only, not citations found in other by-product documents of the patent systems. However, I am maybe wrong on this.

kylehigham commented 4 years ago

Hi @kermitt2,

Thanks for your very detailed answer, and for all your hard work on grobid! I agree that it is challenging to come up with a system that can easily classify documents granted by any patent system.

I will try to clarify my thinking on this. I think of patent-related identifiers as either a tracker code or a published document code. The former is for internal use by patent offices and applicants and is dynamic. An application number is precisely this - it refers to a set of technical information filed on a particular date, but the state of the application (in terms of its technical content) changes through the patent prosecution process. A published document code, on the other hand, refers to a specific, static document that is available for public viewing. These include patents and pre-grant publications.

However, I suspect most users don't want to lump patents and pre-grant publications into the same category, as they are very different kinds of documents with very different legal meanings. It is for this reason why I suggest a separate category for pre-grant publication numbers - they are fundamentally different from both patent numbers and application numbers. I believe this distinction is also relatively universal. This distinction has the additional benefit of being able to have a simple 'patent' category for granted patents, which is also universal (and within which one may construct sub-classifications such as design patents).

Of course, we are happy to homebrew categories for our own use. Still, I thought it would be useful to point out this potentially confusing aspect of patent-related document classification (particular for those using grobid on US patents).