Open slifty opened 5 years ago
@metasj @isTravis do we have any word back about registering a developer URL for the integration with the google black box for this part of the process?
You mean a second URL for a dev sitemap? Is there a way to avoid most of that parsing being duplicated work? I'll ask Ian W, our contact @ G!Patents.
@metasj yes -- it is my understanding that right now there is a v1 production sitemap that gets ingested and an endpoint that spits out the CPC json for that production sitemap; we need one so that we can test that pipeline on v2
Name the dev map and update this issue ~ Ian: "We could aim our staging pipeline at the dev sitemap and output to a separate file."
What name scheme do you want? dev-sitemap.txt and dev-cpc-results.txt ?
Implementing the sitemap generation code as a worker that can be triggered through a package.json script; I believe we are using Heroku, which means that we'll want to set up the Heroku scheduler to call it on a regular (daily?) basis.
Ok, just let me know when you have a dev sitemap and they'll set it up. They will query whatever we make once every day or two (regardless of whether we update it)
@joeltg is dev / prod hosted on heroku? If so it might be good to get me access. The latest PR just created a new environment variable that we'll want to set up, and it would also be good to set up the sitemap generation scheduler.
Once that is in place @metasj I think we'll be in OK shape to pass the dev URL to google.
I think so - @isTravis knows and has access
👍 Added you, @slifty, using your gmail address to both Heroku apps.
@slifty ping me w/ the dev URL when ready.
Right now the site map is being generated (see #8) with the following fields:
{
url: document.fileUrl,
fileId: document.id,
companyName: '',
companyId: document.organizationId,
title: document.title,
dateUploaded: '',
datePublished: document.publicationDate,
sourcePath: '',
}
Notice that some of these fields are not actually being populated.
Two questions for the Cisco team:
Are all of the fields in the site map actually used by Google? Specifically, how important is companyName, dateUploaded, and sourcePath.
Is this sitemap being built in terms of a documented standard somewhere? Or did we invent it. This would be VERY good to formally document somewhere (e.g. on the wiki) and link to that documentation.
@metasj the dev URL is http://s3.amazonaws.com/prior-art-archive-sftp/_priorArtArchive/sitemap_dev.txt
I'm not sure if it is the correct format yet, but lets pass that along anyway so we aren't blocked on the integration.
Right now the site map is being generated (see #8) with the following fields:
{ url: document.fileUrl, fileId: document.id, companyName: '', companyId: document.organizationId, title: document.title, dateUploaded: '', datePublished: document.publicationDate, sourcePath: '', }
Notice that some of these fields are not actually being populated.
Two questions for the Cisco team:
- Are all of the fields in the site map actually used by Google? Specifically, how important is companyName, dateUploaded, and sourcePath.
these fields are important from elastic indexing purpose. all these fields are mandatory in order to get the document to index successfullywe have index per companyname and therefore companyName is required.
- Is this sitemap being built in terms of a documented standard somewhere? Or did we invent it. This would be VERY good to formally document somewhere (e.g. on the wiki) and link to that documentation.
we can definitely document them for future referenceAlso regarding cpc issue : I am getting response from backendcurl -X POST \ https://docker-usptofe.herokuapp.com/rest/esresults \ -H 'Content-Type: application/json' \ -d '{"searchQuery":"H04L29/06.cpc.","searchOperator":"AND","fetchHits":10,"fetchOffset":0,"sortBy":"date","filters":[]}'it returned 10 results
This issue came out of conversation related to issue #13.
We don't have CPC code data in the dev database for v2. The process is as follows:
In order to enable this pipeline we need to set up the URL route that google pings every night to ask for a site map. The URL that google expects the sitemap to be posted to is predetermined:
/cpc/sitemap
Some relevant files for this issue:
The json includes our primary keys, so mapping Google's output to our database should be straightforward.
Some additional requirements