Open tlongers opened 2 years ago
Thanks for your feedback and questions @tlongers. I'm noticing a theme in these questions — it seems like you want to do some extra processing. This is possible, we just need to decide on what to process so we can manage all of that without making the code too complex. cc @hancush
explain the mechanic please.
It's quite simple. It downloads the raw data files from Google Drive and packages them in a zip archive. The zip archive is uploaded to AWS S3, which is then retrievable via an S3 URL whenever anybody visits the download page. There is no filtering — we just download the same files that we input into the importer and zip those files up. This process is a part of the existing import process. We added some extra steps to the existing import process, and input the same files into the importer and zip archive.
is there an opportunity to process the files further
We could do this, though it could become quite involved of a task depending on what all you want to process. Removing columns wouldn't be too much work. But since the archive is just the raw files, there isn't a primary key in those files (at least not one created by the database. There might some key in the spreadsheets?). If you want a primary key from the database, then we'd have to change the process so that we dump out what's in the database rather than using the raw data files. As for validation, the zip archive creation is a part of the import process and won't be uploaded to S3 unless the importer has passed the validation and imported all of the data.
can we generate an variable for archive filesize + number of folders/files inside that can be passed to some additional explanatory text about the archive?
It would be easy to get the archive size, but counting the number of folders/files inside could be tricky. Once the archive is uploaded to S3, the files are deleted from the server and we'd have no way to count without storing this information in the database or decompressing the archive (decompressing would slow down the download page load). However we could share how many countries are contained within the archive without having to do too much.
possible to compile it each country data into a single spreadsheet to spare the user some time doing that?
Yes. It would require some extra processing but not too involved.
possible to expand the filenaming conventiont to "[version]_[two-letter-iso]_entity.csv"?
Yes, this would be quite simple.
what's stopping spamming/misuse of the download URL?
Good question. That URL will expire after an hour. We can do a number of things to prevent spam downloads. We could make the URL expire in a shorter time period, which in theory would shorten the amount of time for spam. We can setup that page so that there is a cache and it doesn't recreate a link on every page load. We could obfuscate this link and not download directly from S3 — this would be a bit involved but might be worth it if there's a higher risk of spam downloading. S3's biggest attack vulnerability is pricing, because a spammer can essentially send out X amount of requests per second on that S3 object and this could become expensive. It costs money per request to retrieve data from S3, though there is throttling which can limit this problem. GET requests cost $0.0004 per 1,000 requests.
is it wired into the web analytics so we can see its use?
We'd be able to see the visits to the download page, but we haven't wired up any event tracking for when somebody presses the "download" button. We have tracking configured only for the production site, so we'd be able to set up download event tracking once we release this change to production.
behind the scenes, is it keeping the old downloads anywhere?
No. It replaces the old data when the latest data is uploaded. We discussed whether we should add a way to download old data, but decided that we'd keep it simple with this first version.
My thoughts ...
We can perform these steps our side, but what would be the work WWIC-side to retain but blank out values from the following attributes in the download package
unit:owner:admin
unit:comments:admin
person:comments:admin
person:owner:admin
incident:comments:admin
incident:owner:admin
source:owner:admin
source:comments:admin
Removing values from these attributes means that we keep some of our internal commentary private and don't have worry about ensuring that is legible to an outside audience.
.csv
into a single .xslx
file for each country? This is more user-friendly. We use sheets.js
What can be added into the download package (or each country spreadsheet) etc that states what the download is (full filelist, generation date), the author and licensing terms, along with our contact details? We tend to include a notes
sheet that contains all this detail. At the least, a copy of the CC4.0 unported license will need to be included.
Heading:
Download data from WhoWasInCommand.com
Paragraph/sub-heading
WhoWasInCommand.com is a free, public database of police, military and other security and defence forces. Click the "Download" button below to get a copy of all the data in the WhoWasInCommand.com database.
Single large button:
"Download(.zip, x MB)"
Sub-heading / para (See what looks good)
What's in the download file?
The download file includes spreadsheets containing data for each country in the WhoWasInCommand.com database. Each spreadsheet includes:
The download file also contains:
The data on published WhoWasInCommand are created by Security Force Monitor, an investigative team based at the Human Rights Institute of Columbia Law School. We provide this data to assist journalists, human rights workers, litigators and others seeking accountability for human rights abuses perpetrated by security and defense forces.
@tlongers
what would be the work WWIC-side to retain but blank out values from the following attributes in the download package
We can do this within the download packaging process. It's not a heavy lift to filter out columns completely or change the values to be empty. We're already working on filtering out any rows that aren't publishable (where < data model >:status:admin != 3
), so we'll just have to filter out those columns you listed, too. Would removing the columns suffice? It would be simplest to remove the columns completely instead of changing the values to be empty.
Is it possible to package all the .csv into a single .xslx file for each country?
Yes we can do this, though shepherding csv into xlsx files isn't the cleanest process. Sometimes the processors can cast floats into weird values — I've had this happen before with sheets.js and it can be frustrating. These sort of issues are possible to work around, but it takes some extra time. I estimate we could do it in 4-6 hours in the best case, though it could go longer if we run into any problems. Plus y'all would need to do extra qa to check that the data right. With that in mind, do you think the value of the user friendliness is worth the time and effort?
Instead of the two letter iso code ("bf") for each country, could a slugfied full name be used ("burkina_faso") for the filename and directory?
The iso code is coming from the import docket, so we use those values to name the file. If y'all could add the slugified name as a new column in the import docket, then we could use that for the names.
Can sources specific to a country's data be filtered out and pulled into a sources sheet just for that country?
Ideally y'all could provide each country's sources within the country's spreadsheet, otherwise we'd need to do some extra processing. The existing import process is knowledgeable about the sources relationship with the different models, but this new process to package the data for download is ignorant of the data relationships. We can make it smarter, but that would be a good bit of work and possibly redundant since we're already managing that in the well-tested import process.
Can we provide a full copy of the documentation inside the zip, to accompany the data? Yes, we can do that. We'll just have to download it directly from the link you provided. It would be best if y'all included a license that we could package up. As for the metadata-like pieces of information (filelist etc), we think it would be best if y'all created a readme that includes that information.
Would removing the columns suffice? It would be simplest to remove the columns completely instead of changing the values to be empty.
We should keep the columns, but empty them.
With that in mind, do you think the value of the user friendliness is worth the time and effort?
Yes, we've come across this specific issue with sheets.js too, and it's provide quite painful to resolve. Forget this for now - it's a tar pit.
The iso code is coming from the import docket, so we use those values to name the file. If y'all could add the slugified name as a new column in the import docket, then we could use that for the names.
That's great; I can do that.
Ideally y'all could provide each country's sources within the country's spreadsheet, otherwise we'd need to do some extra processing. [...]
So you'd be looking for a sources
tab in Google Sheet used to import? Is the importer fussy about tabs it doesn't recognise?
Yes, we can do that. We'll just have to download it directly from the link you provided. It would be best if y'all included a license that we could package up. As for the metadata-like pieces of information (filelist etc), we think it would be best if y'all created a readme that includes that information.
Yep, we can do that. Where should it live; in this repo?
@tlongers No, the importer wouldn't be fussy about an unknown tab.
Good question about where the readme should live. Do you think y'all would want to make edits over time, or would the content be more static? We could source that from a Google Drive folder if you think y'all will make a lot of random changes, otherwise it can live in the repo. Living the in the repo would be simplest from a implementation perspective.
We could source that from a Google Drive folder if you think y'all will make a lot of random changes, otherwise it can live in the repo. Living the in the repo would be simplest from a implementation perspective.
Likely we would update it with each import, indicating the data versions (which are just timestamps), and noting any unfamiliar additions (like new countries). We're happy contributing to a plaintext file in the repo, or in a separate one that the deploy process can call on. The issue I can see here is how to deal with deployments that are non-WWIC.
@tlongers Gotcha. Let us know what y'all decide to do and we can add it to the packaging.
We're close to having these changes ready for your review. We just need these things:
Feedback thread on implementation of #807 (via PR #810 to #815)
Impressions:
Questions:
Next up: