Closed slifty closed 5 years ago
Some questions for @isTravis :
Good questions. This would be a really useful thing to have the Cisco folks formalize. The sitemap (as best I can remember) mirrors the structure they had agreed to send to Google back before MIT was involved. I slightly remember needing to map one of the column names we use in our database to a key that the Google side was expecting.
I hate to pass the answer here, but I think a call with Shikha Mahajan is needed to get you a concrete answer. @metasj we have a call with her coming up, right? (Even better, it'd be great if she were involved with this repo).
To clarify - there is a loop here I never had full insight into. I think it was something like:
Some of the fields in the sitemap were used (I believe) to dedupe and prevent unnecessary Elasticsearch index updates. If we are using hashes, or uuids, I'm not sure we need all the fields in the sitemap - but - I'm guessing here...
This PR generates sitemaps, which is the first part of the CPC process (issue #10 )
Changes
Some key changes from the way this works in v1:
sourcePath
anddateUploaded
) don't seem to be represented any longer in the model.New Config
Two new lines need to be added to your config file:
process.env.SITEMAP_DESTINATION_S3_BUCKET = ''
process.env.SITEMAP_DESTINATION_PATH = ''
The first is the s3 bucket (e.g.
prior-art-archive-sftp
) The second is the path within the s3 bucket (e.g._priorArtArchive/sitemap_dev.txt
)This PR also sneaks in an editor config file, which should be its own PR but don't tell anybody and they might not notice.