participedia / api

Website and API for Participedia V3
https://participedia.net
MIT License
20 stars 13 forks source link

[DESIGN] Participedia DATA Design Jam #895

Open jesicarson opened 4 years ago

jesicarson commented 4 years ago

This issue is the jumping off point for collaboration with anyone interested in using Participedia data (either on its own or in combination with any other data, using any analysis methods).

Our goal is to experiment with the data on Participedia.net using concrete research questions, and in the process to expose and document potential opportunities as well as roadblocks or challenges with using the data in order to improve it. We aim to inform future iterations of the Participedia data model, codebook, data export systems or any other area to support our research community.

Resources:

Research Page on https://participedia.net/research

Clicking any of the links below will automatically download a CSV file containing the latest data from Participedia.net to your device:

Draft Codebook (comment-only Google Doc) available for review.

Collaborator Accessible Data Forms (comment-only Google Docs) contain all of the text and values for the entry forms on Participedia.net:

(To see the entry forms live on Participedia.net, click the white "+ Quick Submit" button at the top of the screen on desktop, or the red "+" button on mobile, and select one of the entry types.)

jesicarson commented 4 years ago

To Do:

jesicarson commented 4 years ago

@Mattygryan welcome to Github!

jesicarson commented 4 years ago

Linking up this issue to #896 - Using MIRO to mock-up a visual overview of the data model. Here's a couple screenshots, and feel free to use this Miro link to collaborate on the board in real time (sign in required).

Screen Shot 2020-04-08 at 3 12 32 PM

Screen Shot 2020-04-08 at 3 02 02 PM

Mattygryan commented 4 years ago

@Mattygryan welcome to Github!

Cheers! I am getting used to the jargon! That @jesicarson your first post is Ace - in the spirit of taking risks can we just basically put that on the research page and then update as necessary? Think we need to be more agile.

The diagram is also Ace I will take a closer look but my intuition is that it should go up in our research page too!

Another way to put it related to #896 - what is a good argument not to have any of this on the reasearch page now?

jesicarson commented 4 years ago

@Mattygryan yes we can definitely make some changes to the research page in the near future (just have to put them in the queue for the dev team, as there's already some other work underway).

Also, great question from Sandy via Slack:

I'm curious who your main audience for the zoom call would be? Researchers involved in public participation? Or techies in PP? Or perhaps both. One thing that might help is to include a few quick examples of who might use Participedia data so people can see themselves as potential users and want to join the call.

My initial thoughts below, but would be great if @Mattygryan could weigh in on the audience question!

AUDIENCES (for this Jam/zoom call, and to detail on the research page): As of now, it seems like it will be for teams of researchers/data scientists in political science who use quantitative data and methods. But also might include some qualitative political science researchers, as they could be tasked with fleshing out subsets of entries for the data folks to experiment with.

ZOOM CALL THOUGHTS: We could aim for a core group of confirmed participants (partners on the participedia project, and their students) and then open the call up to anyone else who may want to hop on, and see what happens. there'd be an agenda but also kind of open to self-organize into potential working groups.

Mattygryan commented 4 years ago

Agreed on the audience @jesicarson. I am quite keen not to have a call about what we might do only though. I agree with Sandy we need to show something to people before any meeting so they have a vision of what we are talking about. We may be able to do this with some existing work e.g. the summaries @plscully is working on, work presented or questions raised in Sussex - do we have a list somewhere?

Mattygryan commented 4 years ago

Here's a summary of some thinking i wrote in email to a discussion among some participedians about a student project that @jesicarson suggested i share -

The question of validity is key.

My thinking has been that what is relatively unique about participedia is that we have crowdsourced a relatively large non-random sample of structured and unstructured data over a long period of time. So we are this exciting or awkward place depending on how you look at it - somewhere in between the methodological considerations that big data researchers deal with, but also those that do large-N qualy archival research, and then small-N comparative case-selection logic, and then that more traditional forms of secondary data analysis think about etc.

We have some almost ‘boiler plate’ things we can say about the quality of the data. The data we have collected emanated from the work of a network of leading scholars in the field who have been reflecting on these issues, and though biased are likely to have some authority about where to look for relevant cases which is one strength of claim. We have also improved the quality of data through review, and of data models, also methods of collection etc.

I think then when we come to doing research, depending on the research question there are different strategies for ‘validating’ any claims from the data depending on what we are trying to speak to. It might be we can make logical inferences within the scope of what we’ve got, or that we have to do some systematic imputation of missing values. It might be that we link to existing secondary data (administrative data in countries we work in, or other datasets collected by researcher NGOs) either as a complement or a robustness test or both. Or for the same reasons some questions might benefit from some supplementary primary data collection – surveys, expert surveys (we have been thinking this might be a good way to use the network we have built as extra validation).

I think I’ve recognised that it is quite difficult to know where to start with answering the question of what we need to do to really enable or facilitate cashing out what we have got, and so my proposal is we take an agile approach… start trying to answer RQs that interest us but share the difficulties and thoughts on overcoming them as a network and team so we can then better know if e.g. development can build in a tool to link, analyse, or structure data, or we should build up the dataverse or try an expert survey or something to see what that can do for us…

jesicarson commented 4 years ago

Moving into a phase of visualizing Participedia data in experimental ways... Check out this experimental visualization of participedia data that our team made in Madrid during the Medialab Prado Collective Intelligence for Democracy Workshop in 2017. The size of the bubbles represent how much media is entered (zoom into Madrid - the biggest bubble). Here's the site: http://madrid.surge.sh/ and a screenshot below. I was thinking we could build on this, but the size of the bubbles could represent the number of files / documents / links included - showing visually what cases have the most additional extra data attached - could this correlate to completeness maybe? It's also just cool to see the entries represented in a different way - as an ecosystem of connected bubbles. Might be able to use this kind of visualization to make linkages between cases, components and the methods they use by clustering bubbles together? @ascott @Mattygryan (ps: @davidascher made this Madrid site so maybe he has some ideas too!) Screen Shot 2020-05-29 at 9 41 07 AM

Mattygryan commented 4 years ago

That's very cool. would be good to map the network of elements we have on the site. Loads of good stuff in the newsletter!

jesicarson commented 4 years ago

@plscully @Mattygryan @ascott Here's a wild idea... can we structure a design jam (perhaps with a few scheduled sessions over the summer) using some or all of the SSHRC 2 research cluster themes? ie: get together and see what data we can mine/visulize from ppedia (and/or other data sources) on each of the themes, with the intention of identifying gaps that will be useful for phase 2 research clusters? Obviously it would be high level and a bit hacky, but could be a fun springboard into ppedia phase 2... maybe a few team members/students would be into it as a fun summer collaboration.

Screen Shot 2020-06-22 at 3 10 17 PM

Mattygryan commented 4 years ago

I think it is a good idea. Could replace an all team meeting in some ways. I suppose format would depend on whether those directly involved in preparing the phase 2 application would find it useful or it would be something separate...

amberfj commented 4 years ago

This sounds like a great long term approach. We would have to organize cases/methods by these deficits if they aren't already. My guess is that this might take a good amount of upfront work by political science researchers prior to visualization. We can help to design this process as a prompt, which might help with the SSHRC application process.

First, we might focus on lower hanging fruit like doing some basic analysis by geographic region using structured factors, e.g. general issues, general types of methods, or scope of influence. This might tell us that certain issues and methods are more common in some regions of Participedia's dataset, providing gist-level insights and identifying gaps in our dataset.

We could also extend this to a more general semantic analysis of case data through a combo of recurrent neural network analysis (RNNs) using TensorFlow with word embeddings and finetuning the model with k-means clustering or something similar. This sounds complicated but we might be able to find an open-source library that does the trick.

On Tue, Jun 23, 2020 at 12:46 AM Matt Ryan notifications@github.com wrote:

I think it is a good idea. Could replace an all team meeting in some ways. I suppose format would depend on whether those directly involved in preparing the phase 2 application would find it useful or it would be something separate...

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/participedia/api/issues/895#issuecomment-647971314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALGGIMSAHAR2R6WCDWYQELRYBMU3ANCNFSM4MEDXL5A .

jesicarson commented 4 years ago

Thanks @Mattygryan and @amberfj! Agreed about the up front work, and it may not be possible with every theme in the proposal (some of them may not have as many pre-existing entries in the database yet, and it sounds like there may be new data models required for them also). However, for the human rights theme, for example, the proposal states that there are at least 160 entries within this theme already on the site - I'm guessing @plscully used search filters to pull this number, is that right Pat? Could we start with this set of results for the human rights theme and do some of the basic analysis Amber suggested? We can then see if we can pull sets of results for the other themes using search filters (or not - which will also help identify gaps). But I agree it would be ideal to have political scientists be part of this up front work.

amberfj commented 4 years ago

Good idea, Jesi. I like the idea of focusing on human rights. Once we have an analysis method set up, we can zoom in to look at small groups of cases based on the human rights filter.

jesicarson commented 4 years ago

The human rights & civil rights filter brings up 130 results - https://participedia.net/?selectedCategory=case&general_issues=human Should we see what @ascott can do with this set to get the ball rolling?

amberfj commented 4 years ago

Sure but would start with the entire dataset of cases first.

Sent from my phone

On Jun 23, 2020, at 10:07 AM, jesicarson notifications@github.com wrote:

 The human rights & civil rights filter brings up 130 results - https://participedia.net/?selectedCategory=case&general_issues=human Should we see what @ascott can do with this set to get the ball rolling?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ascott commented 4 years ago

hey all, lots of interesting ideas here!

i can start with a some basic charts/maps looking at the structured data we have e.g. general issues, general types of methods, or scope of influence, to get a sense of where some gaps in the data might be. I’m thinking of a map with checkboxes to show/hide entries with different filter values to quickly get sense of location density for a given field/value.

looking more specifically at entries related to human rights, we could use the general_issues:human field:value on cases, which gives us 130 cases . we can also use the keyword search with human rights and we get 179 entries. there are of course many other terms and ways of classifying entries as related to human rights. some other terms we could search with might be:

i agree that there could be some work to be done by editors and researchers to search through our data set as is using the keyword search (and other filters), to make sure that the structured data fields are completed for entries related to human rights.

after this, i could create a subset of entries based on our structured data and keyword searches to visualize geographic data or nlp results on the set. i took a closer look into k-means clustering and there are many examples using open source algorithms that we could experiment with. eg: https://observablehq.com/@romaklimenko/finding-palettes-of-dominant-colors-with-the-k-means-cluste