newfs / gobotany-app

Deployable code for the Go Botany application
9 stars 8 forks source link

Checklist of items before we make the repo public #309

Closed sidkoul closed 11 years ago

sidkoul commented 11 years ago

Before we make the gobotany-app github repo public, we should put together a checlist of items of files that should be removed from the repo or commit history, e.g.

[ ] facebook keys [ ] csv data [ ] images [ ] ssh keys

If you think of anything additional items, add them to this list.

brandon-rhodes commented 11 years ago

Let's start with the Facebook keys:

brandon-rhodes commented 11 years ago

Next, SSH keys:

brandon-rhodes commented 11 years ago

Next, CSV data. There never seem to have been any files whose names end in CSV, since the following command produces no output:

git log --name-status --all -- '*.CSV'

All of our data files used lowercase .csv to identify them. Over their time in the repository, each CSV file has often lived several different places over its lifetime, but we can get a list of their plain old names — each listed only once — by running the following command line (the \t trick is because a git log with a --name-status argument puts a real genuine tab character in front of each filename, letting us distinguish them from the actual text of each commit message):

git log --name-status -- '*.csv' | grep -P '\t' | awk -F/ '{print $NF}' | sort -u

The resulting list of CSV files that have ever been in the repository is:

character_values.csv
characters.csv
image_categories.csv
pile_group_info.csv
pile_info.csv
pile_lycophytes.csv
pile_non_orchid_monocots_1.csv
pile_non_orchid_monocots_2.csv
pile_non_orchid_monocots_3.csv
taxa.csv
test_characters.csv
test_taxons.csv
wetland_indicators.csv

This list has 13 files in it, which is very nearly the number (11) of CSV files that currently exist in the tree:

$ git ls-tree --name-only -r HEAD
gobotany/core/image_categories.csv
gobotany/core/testdata/character_values.csv
gobotany/core/testdata/characters.csv
gobotany/core/testdata/pile_group_info.csv
gobotany/core/testdata/pile_info.csv
gobotany/core/testdata/pile_lycophytes.csv
gobotany/core/testdata/pile_non_orchid_monocots_1.csv
gobotany/core/testdata/pile_non_orchid_monocots_2.csv
gobotany/core/testdata/pile_non_orchid_monocots_3.csv
gobotany/core/testdata/taxa.csv
gobotany/core/testdata/wetland_indicators.csv

Sid, this means that you can go into the most recent checkout and review these CSV files to decide what to do about them. I would recommend that, if possible, we leave a rudimentary data set in the public repository with the app, so that potential partners in the future can download and try out the app locally without needing one of our big .csv bundles that we keep password-protected on Amazon S3. Are these testdata files old enough and small enough that they can stick around as a test data set? Or should we look into axing them?

The two files that you cannot examine by looking at that list of 11 extant files are:

test_characters.csv
test_taxons.csv

Here is how they looked at their moment of deletion:

https://github.com/newfs/gobotany-app/blob/5da36102d0da77983f52e7d83d4331421d52ab2c/gobotany/core/test_characters.csv https://github.com/newfs/gobotany-app/blob/5da36102d0da77983f52e7d83d4331421d52ab2c/gobotany/core/test_taxons.csv

To me, they do not look to contain any information that we would need to worry about — less, in fact, than someone could glean by screen-scraping even a single public page of our online species app. So the real decision, as I see it, is about the CSV files that have survived to the present in the repository and that are part of our test suite (I believe?).

brandon-rhodes commented 11 years ago

Finally, what about images? There are several kinds. (And I did check in each case for images with capitalized extensions, but found none.)

git log --name-status -- '*.gif' | grep -P '\t' | awk '{print $NF}' | sort -u

This shows 714 .gif paths, nearly all in dojox and dijit directories, and that therefore are distributed publicly with Dojo and friends. The dozen or so remaining images are all in our static directory and are various decorations that are easy enough for someone to grab from the web site today with wget.

git log --name-status -- '*.png' | grep -P '\t' | egrep -v 'dojo|dijit' | awk '{print $NF}' | sort -u

Here I was more careful, and filtered out Dojo images right in the command, to produce 109 .png images that live or have lived in the repository. They all live under /static/ and, unless I am mistaken, are served up publicly today by the app, and are therefore not private — anyone with wget could grab them right off of the site.

There are no files named .jpeg — O tempora, O mores!

git log --name-status -- '*.jpg' | grep -P '\t' | egrep -v 'dojo|dijit' | awk '{print $NF}' | sort -u

Finally, the above command lists 40 .jpg files that nearly all live beneath /static/graphics or /static/images and appear to be served up as part of the site design. The two exceptions are these:

gobotany/api/testdata/huperzia-appressa-ha-dkausen-1.jpg
gobotany/api/testdata/huperzia-appressa-sc-dkausen-2.jpg

These two images are loaded and saved as part of our tests for the Go Botany API, and it would be helpful if they could stay in the repository. From what I can see, they are publicly available on the web site to anyone browsing the Lycophytes, and so we are not exposing any extra information to the public by leaving these in the repository.

Unless, therefore, anyone can think of any files that I am missing (did we use any extensions beyond these few for images?), the images in the repository history should remain there as the repository is made public.

brandon-rhodes commented 11 years ago

tl;dr

Sid and JR, please disable all RECAPTCHA keys that lived in settings.py and then replace them with code like the code I have written for Facebook settings that tries to grab them from the environment.

Sid, I suspect that all files should stay in the repository. I have just checked, and every CSV data file still in the repository is still used today by our tests or code — none of them are inadvertent leftovers. Furthermore, most of them are quite small, and obviously are tiny samples of our much larger data sets.

The one exception that I can see is taxa.csv which lists 3,917 taxa, which appears to have been our total database way back when the file was generated back in March 2012. The number of fields included is 36, including conservation statuses and synonyms, and getting hold of the file would save someone maybe an hour or two of work screen-scraping our species pages to get the same information. Take a look, Sid, and if you don't want the whole species file being public then we can destroy it and perhaps use a tiny fake species file with only a dozen species or something drawn from the real file to drive the tests that it impacts.

What do you think?

jrrickerson commented 11 years ago

Moved django-registration and reCAPTCHA config to the environment, regenerated "localhost" reCAPTCHA keys, and added a dev/environment script for convenience. a1ab2eeb4efa3d95ca354ca561e72ad01dc9f6d0 e173f694fb4b29f59132a268674e692fd2c1b2bb f135f82305d6f9cb4cdc7515475ad5a4e215d45e 500fef3a9bb4428210ac59c88b89dd75fe8ee60c

Once Sid recreates the reCAPTCHA keys for the public dev and production environments, and we add them to the Heroku config, this task should be complete.

brandon-rhodes commented 11 years ago

Sid, where do things stand with respect to the reCAPTCHA key re-creation?

sidkoul commented 11 years ago

I have deleted old reCAPTCHA keys, created a new key for the dev server domain, and stuck them in the heroku application's RECAPTCHA_PRIVATE_KEY and RECAPTCHA_PUBLIC_KEY environment variables.

The PlantShare registration page uses reCAPTCHA and you can see it in action there -- http://gobotany-dev.herokuapp.com/ps/accounts/register/

brandon-rhodes commented 11 years ago

Great! Let me know if there are any final preparations to make before you press the “Make public” button.

(Or whatever it's called.)

sidkoul commented 11 years ago

I have checked with Bill--we are keeping our sample test data set in the repo. It looks like we've covered all the items in our checklist. Have I missed any?

sidkoul commented 11 years ago

Repo is public.