prior-art-archive / priorartarchive.org

Prior Art Archive Site
https://priorartarchive.org
GNU General Public License v2.0
3 stars 1 forks source link

Implement CPC pipeline #10

Open slifty opened 5 years ago

slifty commented 5 years ago

This issue came out of conversation related to issue #13.

We don't have CPC code data in the dev database for v2. The process is as follows:

  1. Google accesses a site map on a regular basis and processes the site.
  2. Google publishes a json file containing CPC codes related to each page.

In order to enable this pipeline we need to set up the URL route that google pings every night to ask for a site map. The URL that google expects the sitemap to be posted to is predetermined: /cpc/sitemap

Some relevant files for this issue:

  1. Here's an example of a sitemap google would parse.
  2. The code that generates a sitemap doesn't exist in this version yet.
  3. The output from google which will always live in the same URL and needs to be ingested on a regular basis.

The json includes our primary keys, so mapping Google's output to our database should be straightforward.

Some additional requirements

  1. We should re-publish the sitemap on an hourly cadence
  2. We should ingest the json on a daily cadence.
  3. CPC codes should always be updated to reflect the latest json
  4. there is no need to store historic values for CPC codes.
  5. CPC codes should be truncated to the first four characters to reduce false specificity
slifty commented 5 years ago

@metasj @isTravis do we have any word back about registering a developer URL for the integration with the google black box for this part of the process?

metasj commented 5 years ago

You mean a second URL for a dev sitemap? Is there a way to avoid most of that parsing being duplicated work? I'll ask Ian W, our contact @ G!Patents.

slifty commented 5 years ago

@metasj yes -- it is my understanding that right now there is a v1 production sitemap that gets ingested and an endpoint that spits out the CPC json for that production sitemap; we need one so that we can test that pipeline on v2

metasj commented 5 years ago

Name the dev map and update this issue ~ Ian: "We could aim our staging pipeline at the dev sitemap and output to a separate file."

What name scheme do you want? dev-sitemap.txt and dev-cpc-results.txt ?

slifty commented 5 years ago

Implementing the sitemap generation code as a worker that can be triggered through a package.json script; I believe we are using Heroku, which means that we'll want to set up the Heroku scheduler to call it on a regular (daily?) basis.

metasj commented 5 years ago

Ok, just let me know when you have a dev sitemap and they'll set it up. They will query whatever we make once every day or two (regardless of whether we update it)

slifty commented 5 years ago

@joeltg is dev / prod hosted on heroku? If so it might be good to get me access. The latest PR just created a new environment variable that we'll want to set up, and it would also be good to set up the sitemap generation scheduler.

Once that is in place @metasj I think we'll be in OK shape to pass the dev URL to google.

joeltg commented 5 years ago

I think so - @isTravis knows and has access

isTravis commented 5 years ago

👍 Added you, @slifty, using your gmail address to both Heroku apps.

metasj commented 5 years ago

@slifty ping me w/ the dev URL when ready.

slifty commented 5 years ago

Right now the site map is being generated (see #8) with the following fields:

{
            url: document.fileUrl,
            fileId: document.id,
            companyName: '',
            companyId: document.organizationId,
            title: document.title,
            dateUploaded: '',
            datePublished: document.publicationDate,
            sourcePath: '',
}

Notice that some of these fields are not actually being populated.

Two questions for the Cisco team:

slifty commented 5 years ago

@metasj the dev URL is http://s3.amazonaws.com/prior-art-archive-sftp/_priorArtArchive/sitemap_dev.txt

I'm not sure if it is the correct format yet, but lets pass that along anyway so we aren't blocked on the integration.

shikham01 commented 5 years ago

Right now the site map is being generated (see #8) with the following fields:

{
          url: document.fileUrl,
          fileId: document.id,
          companyName: '',
          companyId: document.organizationId,
          title: document.title,
          dateUploaded: '',
          datePublished: document.publicationDate,
          sourcePath: '',
}

Notice that some of these fields are not actually being populated.

Two questions for the Cisco team:

  • Are all of the fields in the site map actually used by Google? Specifically, how important is companyName, dateUploaded, and sourcePath.

these fields are important from elastic indexing purpose. all these fields are mandatory in order to get the document to index successfullywe have index per companyname and therefore companyName is required.

  • Is this sitemap being built in terms of a documented standard somewhere? Or did we invent it. This would be VERY good to formally document somewhere (e.g. on the wiki) and link to that documentation.

we can definitely document them for future referenceAlso regarding cpc issue : I am getting response from backendcurl -X POST \ https://docker-usptofe.herokuapp.com/rest/esresults \ -H 'Content-Type: application/json' \ -d '{"searchQuery":"H04L29/06.cpc.","searchOperator":"AND","fetchHits":10,"fetchOffset":0,"sortBy":"date","filters":[]}'it returned 10 results