prior-art-archive / priorartarchive.org

Prior Art Archive Site
https://priorartarchive.org
GNU General Public License v2.0
3 stars 1 forks source link

Generate sitemaps #8

Closed slifty closed 5 years ago

slifty commented 5 years ago

This PR generates sitemaps, which is the first part of the CPC process (issue #10 )

Changes

Some key changes from the way this works in v1:

  1. Instead of using promises this uses async functions, which makes for more readable code.
  2. Some of the fields (sourcePath and dateUploaded) don't seem to be represented any longer in the model.
  3. I forgot to re-implement company title lookup whoops; doing that now but review anyway please.
  4. AWS locations are now driven by config files.

New Config

Two new lines need to be added to your config file:

The first is the s3 bucket (e.g. prior-art-archive-sftp) The second is the path within the s3 bucket (e.g. _priorArtArchive/sitemap_dev.txt)

This PR also sneaks in an editor config file, which should be its own PR but don't tell anybody and they might not notice.

slifty commented 5 years ago

Some questions for @isTravis :

isTravis commented 5 years ago

Good questions. This would be a really useful thing to have the Cisco folks formalize. The sitemap (as best I can remember) mirrors the structure they had agreed to send to Google back before MIT was involved. I slightly remember needing to map one of the column names we use in our database to a key that the Google side was expecting.

I hate to pass the answer here, but I think a call with Shikha Mahajan is needed to get you a concrete answer. @metasj we have a call with her coming up, right? (Even better, it'd be great if she were involved with this repo).

To clarify - there is a loop here I never had full insight into. I think it was something like:

Some of the fields in the sitemap were used (I believe) to dedupe and prevent unnecessary Elasticsearch index updates. If we are using hashes, or uuids, I'm not sure we need all the fields in the sitemap - but - I'm guessing here...