openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Expose patent family ID on the API #359

Closed madgpap closed 8 years ago

madgpap commented 8 years ago

The patent family ID (scco:SCCO_000008) is an integer which is included in the surechembl_1.3_patent.ttl.gz file:

<http://rdf.ebi.ac.uk/resource/surechembl/patent/US-20140024742-A1> a scco:SCCO_000002 ;
        rdfs:label "US-20140024742-A1" ;
        scco:SCCO_000008 "44022805"^^xsd:int ;
        scco:SCCO_000007 "2014-01-23"^^xsd:date .
<http://rdf.ebi.ac.uk/resource/surechembl/patent/US-20140024751-A1> a scco:SCCO_000002 ;
        rdfs:label "US-20140024751-A1" ;
        scco:SCCO_000008 "49947076"^^xsd:int ;
        scco:SCCO_000007 "2014-01-23"^^xsd:date .

The ID gives important info on the patent, however it is not exposed in the API (/patent call) at the moment.

AlasdairGray commented 8 years ago

Is there a use case here for a lens? Get all patents that are in the same family group vs getting only the specified patent.

stain commented 8 years ago

I would have expected an URI for a patent family if this is shared across patents - e.g. to look up patents in a family like @AlasdairGray Alasdair mentions.

e.g.

http://rdf.ebi.ac.uk/resource/surechembl/family/44022805 

might there be additional information about the family, like a name?

stain commented 8 years ago

For the given patent family 44022805 I find

http://rdf.ebi.ac.uk/resource/surechembl/patent/EP-2694606-A1 "EP-2694606-A1" http://rdf.ebi.ac.uk/resource/surechembl/patent/US-20140024742-A1 "US-20140024742-A1" http://rdf.ebi.ac.uk/resource/surechembl/patent/WO-2012138348-A1 "WO-2012138348-A1"

Does this mean that these patents are somewhat similar (e.g. skos:closeMatch) or prov:alternateOf each other?

stain commented 8 years ago

I had a go expanding the patent call so it includes the family siblings using prov:alternateOf:

<http://rdf.ebi.ac.uk/resource/surechembl/patent/US-20140024742-A1> void:inDataset <http://www.ebi.ac.uk/surechembl> ;
  dct:title "COATING COMPOSITION, AND A PROCESS FOR PRODUCING THE SAME" ;
  prov:alternateOf <http://rdf.ebi.ac.uk/resource/surechembl/patent/EP-2694606-A1> ,
    <http://rdf.ebi.ac.uk/resource/surechembl/patent/WO-2012138348-A1> .

It's done as an OPTIONAL - so for instance for http://rdf.ebi.ac.uk/resource/surechembl/patent/EP-1339685-A2 there are no siblings listed.

stain commented 8 years ago

I would advise against exposing directly

    scco:SCCO_000008 "50188807"^^xsd:int

and encourage SureChembl to change this - which is obviously a SQL foreign key - to perhaps produce a separate RDF file surechembl_1.3_patent_family.ttl.gz with the different patents organized as being skos:inScheme <http://rdf.ebi.ac.uk/resource/surechembl/patentoffice/US> etc (so you know which authority they are under) - and then have skos:closeMatch mapping from patents in different families, equivalent to my prov:alternateOf statement above.

Alternatively you need to create a more abstract patent identifier that they are all prov:specializatonOf - this avoids duplicating the sibling relationship on both sides, but allows you to also provide some information about the family (label?) and perhaps its provenance (how are patents agreed to be in the same family?)

stain commented 8 years ago

Although I added it to the API, I'll leave this issue open to also update the swagger output docs and https://dev.openphacts.org/docs/develop

madgpap commented 8 years ago

@stain Thanks. The Family ID is not a database foreign key. It is an otherwise arbitrary number that helps grouping members of the same family. It is provided to us via IFI Claims (who, in turn, take it directly from the EPO). More about patent families here: http://www.epo.org/searching-for-patents/helpful-resources/first-time-here/patent-families/about.html and http://www.epo.org/searching-for-patents/helpful-resources/first-time-here/patent-families/definitions.html

Members of the same patent family describe the same invention and usually have identical titles and very similar text.

stain commented 8 years ago

So it's an IFI or EPO identifier, really?

It says there that the families are different in what kind of relationship there is between the patents. I think something like prov:alternateOf or skos:related is as good as we get at saying what that relationship is then.

I like your suggestion about a batch call to group a list of patents into families.

madgpap commented 8 years ago

We use the "simple" family definition. Quoting the EPO website:

If all the priorities of two documents are the same, they are referred to as "equivalents". This definition is currently used in Espacenet for listing the documents under "also published as" on the bibliographic data view.
madgpap commented 8 years ago

This is implemented with the alternateOf in the patent call. One caveat: the dummy family ID value is -1 so we want to ignore this and not group by it in the alternateOf method - otherwise it will introduce a major bug. Closing!