Generate sitemaps - Githubissues

slifty commented 5 years ago

This PR generates sitemaps, which is the first part of the CPC process (issue #10 )

Changes

Some key changes from the way this works in v1:

Instead of using promises this uses async functions, which makes for more readable code.
Some of the fields (sourcePath and dateUploaded) don't seem to be represented any longer in the model.
I forgot to re-implement company title lookup whoops; doing that now but review anyway please.
AWS locations are now driven by config files.

New Config

Two new lines need to be added to your config file:

process.env.SITEMAP_DESTINATION_S3_BUCKET = ''
process.env.SITEMAP_DESTINATION_PATH = ''

The first is the s3 bucket (e.g. prior-art-archive-sftp) The second is the path within the s3 bucket (e.g. _priorArtArchive/sitemap_dev.txt)

This PR also sneaks in an editor config file, which should be its own PR but don't tell anybody and they might not notice.

slifty commented 5 years ago

Some questions for @isTravis :

Are all of the fields in the site map actually used?
Is this sitemap being built in terms of a documented standard somewhere? Or did we invent it. This would be VERY good to formally document somewhere (e.g. on the wiki) and link to that documentation.

isTravis commented 5 years ago

Good questions. This would be a really useful thing to have the Cisco folks formalize. The sitemap (as best I can remember) mirrors the structure they had agreed to send to Google back before MIT was involved. I slightly remember needing to map one of the column names we use in our database to a key that the Google side was expecting.

I hate to pass the answer here, but I think a call with Shikha Mahajan is needed to get you a concrete answer. @metasj we have a call with her coming up, right? (Even better, it'd be great if she were involved with this repo).

To clarify - there is a loop here I never had full insight into. I think it was something like:

Sitemap generated.
Google creates CPC codes and returns a file that uses a specific key from the sitemap to associate with the generated CPC codes.
Elasticsearch parses that file and adds the CPC codes to the index.

Some of the fields in the sitemap were used (I believe) to dedupe and prevent unnecessary Elasticsearch index updates. If we are using hashes, or uuids, I'm not sure we need all the fields in the sitemap - but - I'm guessing here...

prior-art-archive / priorartarchive.org

Generate sitemaps #8

Changes

New Config