Patent-related document categories are ambiguously worded

kylehigham commented 4 years ago

Currently, patent-related documents can be assigned one of the following categories: Publication, Application, Reissued, and Provisional. I think this is an odd selection of terminology, particularly for US documents. There are a couple of reasons for this:

'Publication' and 'Application' are ambiguous terms. Patents themselves are publications, but so are published patent applications or international search reports such as those generated through the PCT process. Further, published patent applications have a different document ID than the underlying application. This is an important distinction because in the US applications filed before the year 2000 were not published but can be referenced. So, either we can call published patent applications 'publications', in which case they are not 'applications', or we can call them 'applications', in which case we may as well change the name 'publications' to 'patents' and avoid confusion. Additionally, PCT documents are certainly publications but never patents, and it doesn't make sense to have these two objects in the same category.
Reissued patents are an important category of patents, but they are already indicated by the 'RE' at the start of the patent number. It is odd to include a category for reissued patents and not a category for design patents. Usually, patent offices consider both of these types of documents 'patents' which can be distinguished from utility patents by the letters at the start of the identifier. In short, if grobid is extracting these letters (which are part of the document ID), then there is no need for a separate category of documents. However, if this category is kept in place, it should be accompanied by a category for design patents, which are much more numerous, as well as a category.

I suggest a change in the way patent documents are categorised so each category only contains the same kind of document in a way that is useful for researchers. I am happy to discuss the different categories that would be useful for different use cases, but for a start, I think the following categories would be useful: 'Patent', 'Application', 'Published Pre-Grant Document', 'Provisional Application', and 'Other'. 'Patent' can be split into utility, design, reissue, plant rights, or even utility models if need be (this last document type could be its own category as well). I think these categories reflect the universal patent prosecution processes across jurisdictions - provisional application establishes priority date (where available), a formal application is then filed, and patents are granted at the other end of the process. Throughout this process, auxiliary documents are generated, including published applications and search reports, which all fit into the Published Pre-Grant Document category.

kermitt2 commented 4 years ago

Hello @kylehigham and thank you very much for raising these questions on the patent bibliographical information model used in Grobid.

OK first I am a bit rusty with the patent stuff, but when I developed the current approach in Grobid, I was still in shape :). The current representation in Grobid is inspired by DocDB, the patent bibliographical master database of the EPO, which is covering 90 different patent authority (used for offering their different patent information products, which are the most widely used in the patent world afaik). So the initial idea was to have something covering all patent systems, not just the US ones for instance.

The main issue I think is considering that the "Publication, Application, Reissued, and Provisional" information are distinct categories, and that a citation receives one of them in a mutually exclusive manner. In the internal GROBID format, these information are flags that applied as a set of features to a citation. So:

"publication" means that the cited patent has been published or not, without presuming about the status or the kind of this patent. Most of the time, it's a patent document which is cited, but sometimes a non-published patent application is cited (so to be clear we cite a "patent" before the 18 months period after earliest priority date, possibly withdrawn during this period)
"application" means that what is cited is a patent at its application stage (published or not). This flag will be true for PCT-stage patent application, for the application publications (the A publications) and false for every publications on the patent after it has been granted (e.g. B publications), whatever the kind of patent which is cited
"reissued" and "provisional" characterizes further the cited patent, but does overlap with application information
there is a "design" flag currently but it is not visible as result, because design patents are not really supported by the current Grobid, there is no training data to recognize this "aspect" (it is not frequent to see a design patent cited in a regular patent). But in the current approach, "design patent" is simply an additional flag further characterizing a patent citation.
the letters that are part of the patent citation are translated into flags to characterize the citation, this is the current approach to get a bit more abstraction to a particular patent "numbering" system (as there are actually many). These letters are then part of the normalized patent identifier created by Grobid to be in line with DocDB/EpoDoc format or a "WYSIWYG" non-normalized format, because they have to be part of the identifier by definition.

The flags are as follow currently:

    private Boolean application = false;
    private Boolean provisional = false;
    private Boolean reissued = false;
    private Boolean plant = false;
    private Boolean design = false;
    private Boolean utility = false;

If I remember well, in the US we have 3 types of patent: utility, design or plant. A "utility patent" is what we call in general a patent in the rest of the world. However, we can have more types of "patents" in other countries. The utility flag here in Grobid corresponds to the "utility models" which are something distinct from a patent (from a "utility patent" in the US) and from a "design patent" (or "design model" in EU). It's confusing of course, but I suppose it's because the patent systems in different countries/regions are using the same term for different legal things, so whatever term we use it will always be ambiguous/confusing from the point of view of one system. What's important is to choose a model that can cover all the types of product variants for all the patent authorities, and that this model can fit something like DocDB which is designed to cover all the patent authority mess.

Given these explanations, does the Grobid format appear less confusing?

I see clearly the problem you raise when I output a unique category in the XML results, I pick-up a type among "application", "provisional", "reissued", "plant", "design" and "publication", because the XML/TEI is expecting a single type. I think the main problem is to select a unique category and we should also keep the "flag" system in the outputted XML.

The categories you propose can also not be mutually exclusive I think, and are overlapping for instance 'Published Pre-Grant Document' would correspond to a A publication, so to an Application. Similarly a reissued patent can be possibly a utility patent, design patent or plant in the US system, which is reissued after defect as a "normal" utility patent/design patent/plant patent variants (I think a "reissued" patent can be at the same time a provisional application - but we are maybe too deep in the particularities of the USPTO system).

What I have not considered is the possibility to cite a search report itself (a WOSA typically), because I have never seen this in a patent publication. It's normally a communication in the patent prosecution that cites this kind of by-products and it's outside the scope of Grobid which is covering the processing of patent document texts only, not citations found in other by-product documents of the patent systems. However, I am maybe wrong on this.

kylehigham commented 4 years ago

Hi @kermitt2,

Thanks for your very detailed answer, and for all your hard work on grobid! I agree that it is challenging to come up with a system that can easily classify documents granted by any patent system.

I will try to clarify my thinking on this. I think of patent-related identifiers as either a tracker code or a published document code. The former is for internal use by patent offices and applicants and is dynamic. An application number is precisely this - it refers to a set of technical information filed on a particular date, but the state of the application (in terms of its technical content) changes through the patent prosecution process. A published document code, on the other hand, refers to a specific, static document that is available for public viewing. These include patents and pre-grant publications.

However, I suspect most users don't want to lump patents and pre-grant publications into the same category, as they are very different kinds of documents with very different legal meanings. It is for this reason why I suggest a separate category for pre-grant publication numbers - they are fundamentally different from both patent numbers and application numbers. I believe this distinction is also relatively universal. This distinction has the additional benefit of being able to have a simple 'patent' category for granted patents, which is also universal (and within which one may construct sub-classifications such as design patents).

Of course, we are happy to homebrew categories for our own use. Still, I thought it would be useful to point out this potentially confusing aspect of patent-related document classification (particular for those using grobid on US patents).

kermitt2 / grobid

Patent-related document categories are ambiguously worded #602