Add CSV download url for all cases, methods and orgs

ascott commented 5 years ago

There is an older issue to track this, but it was based on the old model, so creating this new issue to track our progress/decisions about this.

enable urls to download csv data dumps of each article type: cases, orgs, methods.

/?selectedCategory=case&returns=csv
/?selectedCategory=method&returns=csv
/?selectedCategory=organizations&returns=csv

the csv will include all articles and will not be a subset based on search

ascott commented 5 years ago

for reference here is the csv i originally created: https://docs.google.com/spreadsheets/d/1VtyjbjfBYpDiI9v9SSYAEjCWejYx-wzTlL_8JgnwNk4/edit#gid=893083033 and scott's feedback notes: https://docs.google.com/document/d/1JzlhPvDiBU9l3kn5tqh5bJxDBaJf8aQXzCGDhUL85e0/edit#

ascott commented 5 years ago

@scottofletcher @jesicarson @plscully here is the latest csv file with incorporated changes from scott's notes and our meeting: https://docs.google.com/spreadsheets/d/1MZDvZRWCWJDkZqEKW0NFgcG4CNNNZq4-M0ubWgTScDQ/edit?usp=sharing (this is just placeholder data for now from my local database while we get the format nailed down)

changes:

added url to article
added creator_id, creator_name, creator_profile_url fields (and same for last_updated_by)
converted file and link type fields to counts of how many of each type were uploaded (ie: files_count: 1)
converted is_component_of and primary_organizer fields to 3 new fields with id, title and url
for fields like has_components and specific_methods_tools_techniques i added new fields with the count for each of these fields. since they can contain many items with id, type, title, it makes it difficult to share that in the csv file since it's a list of nested data. i thought counts for these fields might be useful.
i formatted all date fields into ISO 8601 format (eg: 2019-07-17T22:00:05.498Z), which can be sorted by date. we need to use a date/time format and not just a date format so they can be sorted not only by date but also by time updated.
removed square brackets from list fields (eg: []), and converted to a comma separated list
stripped html from the body field so we are including text only
reordered the fields

plscully commented 5 years ago

@ascott This looks great! Thank you! I downloaded the file as Excel. Here's a screenshot of the msg I saw when I tried to open it. https://www.dropbox.com/s/6ml1trq5g8ugfem/Screenshot%202019-07-18%2009.32.20.png?dl=0 .... Once the download was complete, I saw this msg https://www.dropbox.com/s/0cpei0f7lszkuf8/Screenshot%202019-07-18%2009.28.49.png?dl=0 . At first glance, the converted Excel file looks OK, but you and @scottofletcher will be able to judge that.

plscully commented 5 years ago

Here's the Excel file
participedia-data-cases-(local placeholder data)-july17.xlsx

scottofletcher commented 5 years ago

couple things:

primary organizer (id, title, url) should be moved after insights_outcomes
implement the same id/title/url format for has_components (although the count is useful, we need a full list)
implement the same id/title/url format for specific_methods_tools_techniques and move those columns after tools_techniques_types (again, the count is useful, but our partners want specific entries)

ascott commented 5 years ago

@plscully thanks for the excel screenshots. could you send me the log file it links to in the last screenshot you sent?

@scottofletcher for has_components and specific_methods_tools_techniques, these fields can have multiple items so they are not like is_component_of where it represents a single article where we can convert to 3 fields with id, url and title. for each item in has_components and specific_methods_tools_techniques we would need to create the three fields and label them with numbers as well. could we limit these to the first 3 items for each? that would look like adding the following columns: has_components_1_id, has_components_1_title, has_components_1_url, has_components_2_id, has_components_2_title, has_components_2_url, has_components_3_id, has_components_3_title, has_components_3_url (and the same for specific_methods_tools_techniques.) what do you think?

scottofletcher commented 5 years ago

yes, that's what I thought we'd have to engineer, but we can't limit it to 3 b/c the whole point of those fields is to encourage users to link to as many (relevant) cases and methods/tools as possible (thereby creating more robust, interlinking datasets). I'm not sure what to do about that - perhaps Matt and Kate will be able to weigh in. the old .net csv simply listed their titles separated by commas (like we currently do for other fields like gen issues, spec topics, etc.). is it possible to do that for now?

ascott commented 5 years ago

@scottofletcher yes we could list their titles or the urls separated by commas. would title or url be more useful?

plscully commented 5 years ago

@ascott I'm not sure if this is what you are looking for, but I downloaded the file again and then opened the link when the same msg appeared. I then pasted what appeared in my browser into this Word doc
case_Excel data download log 18 July 2019.docx

scottofletcher commented 5 years ago

@ascott probably just title at this stage. the nice thing about methods/tools/techniques is that their titles are pretty short and to-the-point (unlike some of our cases....). components are obviously a bit more lengthy, but I think title is still better than url (at least you can get a sense of what kinds of components the case has). I'll flag these fields as something to bring up with Matt & Kate unless you, @dethe and/or @jesicarson have any ideas?

ascott commented 5 years ago

@plscully thanks, i think that error has something to do with a cell having too much text in it. i will look into it.

@scottofletcher i'll make that change, and then we can adjust as needed after matt and kate have had a chance to review.

scottofletcher commented 5 years ago

it might be that we have implement what Dethe first suggested: a system for 'hard-core quants' to request 'hard-core' dataset with all the bells and whistles. I think this works fine for the average folk :)

jesicarson commented 5 years ago

Go team 👏👏

plscully commented 5 years ago

This is great!! Thank you!!

On Thu, Jul 18, 2019 at 9:11 PM jesicarson notifications@github.com wrote:

Go team 👏👏

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/participedia/api/issues/681?email_source=notifications&email_token=AHBFHUI47OQRAZFUJCF4GWTQAEIERA5CNFSM4IEV5AF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2KIQOI#issuecomment-513050681, or mute the thread https://github.com/notifications/unsubscribe-auth/AHBFHUPGOK2JGR4ZQ7ZKP53QAEIERANCNFSM4IEV5AFQ .

-- Patrick L. Scully, Ph.D. President, Clearview Consulting, LLC T 860.561.1866 www.clearviewconsultingllc.com http://www.clearviewconsultingllc.com

ascott commented 5 years ago

@scottofletcher these are ready to test with production data now. you can download the csv's with these urls:

https://participedia.net/?selectedCategory=case&returns=csv https://participedia.net/?selectedCategory=method&returns=csv https://participedia.net/?selectedCategory=organizations&returns=csv

scottofletcher commented 5 years ago

Awesome!!! I'll try it out first thing tomorrow morning!

scottofletcher commented 5 years ago

Changes needed:

CASES

[ ] columns after original language (column O) should be in the following order:
general issues
specific topics
location (8 columns: address1, address2,city, province, country, lat, long)
scope of influence
is component of (3 columns: id, title, url)
has components title
start date
end date
ongoing
time limited
purpose
approach
public spectrum
number of participants
open or limited
recruitment method
targeted participants
method types
tool/technique types
specific methods/tool/techniques titles
legality
facilitators
facilitator training
face-to-face/online
participant interaction
learning resources
decision methods
if voting
primary organizer (3 columns: id, title, url)
organizer type
funder
funder types
staff
volunteers
evidence of impact
types of change
implementers of change
formal evaluation
body
photos count
files count
videos count
audio count
evaluation report count
evaluation links count
[ ] time_limited values are currently 'repeated' or 'a'. please change 'a' to 'limited'

METHODS

[ ] columns after original language (column O) should be in the following order:
face-to-face
method type
typical purpose
public spectrum
open limited
recruitment method
number of participants
types of interaction
facilitation
decision methods
if voting
scope of implementation
level of polarization
level of complexity
body
photos count
files count
videos count
links count
audio count

ORGANIZATIONS

[ ] columns after original language (column O) should be in the following order:
location (8 columns: address1, address2,city, province, country, lat, long)
scope of influence
sector
gen issues
specific topics
type method
type tool
specific method/tool/technique titles
body
photos count
files count
videos count
links count
audio count

QUESTION: the field 'General Types of Methods' is reported differently in the spreadsheets (method_types for cases and methods, type_method for orgs). can we confirm that this is the same field across cases (ie. when we have all our sidebar data hyperlinked, if I click on a general type of method in, say, a case, it will return all cases, methods, and orgs that have that type entered out in the field relevant to that entry type?

ascott commented 5 years ago

@scottofletcher yes, method_types and type_method are the same across all article types but just have different names

scottofletcher commented 5 years ago

ah, ok good. is it possible to use the same name across datasets?

scottofletcher commented 5 years ago

@ascott is it possible to have the dates in a more succinct format? we just need dd/mm/yyyy. otherwise it makes it difficult for editors to track when edits were made

ascott commented 5 years ago

@scottofletcher the reason for using the current date format of 2019-07-24T14:15:11.206Z is that it's an international standard for displaying date & time. A couple issues with dd/mm/yyyy is that it doesn't include a timestamp and it's not understood the same way internationally which could cause confusion:

https://en.wikipedia.org/wiki/Date_format_by_country

Writers have traditionally written abbreviated dates according to their local custom, creating all-numeric equivalents to dates such as '26 July 2019' (26/07/19) and 'July 26, 2019' (07/26/19). This can result in dates that are impossible to understand correctly without knowing the writer's origin and/or other contextual details, as dates such as "10/11/06" can be interpreted as "10 November 2006" in the DMY format, "October 11, 2006" in MDY, and "2010 November 6" in YMD.

i would recommend sticking with the current international format. it can be sorted chronologically, so i'm unclear on how it makes it difficult for editors to track when changes have been made. what are issue are you seeing?

ascott commented 5 years ago

ah, ok good. is it possible to use the same name across datasets?

@scottofletcher i will update the csv's so this field uses the same key across csv's

scottofletcher commented 5 years ago

@ascott RE date format: when we edit an entry, we manually plug in the day we edited it (see column G in this spreadsheet: https://docs.google.com/spreadsheets/d/1uiSNHVzTWByC9ZgawKLcE7WTVyiKELQFxn_lthQDgFI/edit?usp=sharing). that messes up the ability to sort by edit date. the only alternative I can think of is downloading the CSV to get the edit date to plug in, but that would be crazy labour intensive...

scottofletcher commented 5 years ago

Is it possible to get a 'count' for all data fields? the media field counts are really helpful, but it would be great if I could see which entries need the most work (ie. have the fewest fields completed)

participedia / api

Add CSV download url for all cases, methods and orgs #681