Project roadmap? - Githubissues

dhenderson commented 8 years ago

Hey there - thanks for starting this project, I'm interested in hearing where you see this going and how you see this differing from something like Propublica's nonprofit api.

I've built some tools for scraping nonprofit data for a sideproject called GaveTo and might be interested in collaborating on this project if you're looking for help.

chadokruse commented 8 years ago

Hey David - Thanks for starting the conversation!

First, taking a step back, my primary motivation for launching the project was two-fold:

I had a treasure trove of structured data on nonprofits that others might find useful
Gathering certain profile information on nonprofits (logo, Bitcoin address, plain english summary, blog RSS feed, etc) seemed unnecessarily difficult

The intent of the project is to provide the profile information that developers like you and I need, but can't be found via IRS-based databases (e.g. the Propublica API). Guidestar has their Exchange API of course, but at $5k per year it's simply out of reach for most solo developers and small startups.

Tentative roadmap:

Gather any/all existing data
Test assumption that developers find it useful
Decide on a schema
Test methods for gathering missing data

Would love your thoughts/critiques on any of the above.

PS - love your GaveTo project...great work!

dhenderson commented 8 years ago

Nice roadmap, and certainly is a datasource I wish was available so happy to help contribute. At the very least I'm happy to contribute any GaveTo data and to open source/run my scrapers to contribute data back once there's a schema setup.

Perhaps it would be helpful to have a set of ideal indicators, which will likely also guide the schema. For example:

{
  "name" : "Family Independence Initiative",
  "ein" : "020784790",
  "media": {
    "websites" : ["http://www.fii.org/"],
    "blogs" : ["http://www.fii.org/blog/"],
    "facebook" : "https://www.facebook.com/FamilyIndependenceInitiative"
  },
  "target_geographies" : {
    "countries" : ["USA"]
  },
 "ntee" : ["P20"],
  ...
}

Obviously not exhaustive, but perhaps a starting point. The biggest pain points for me developing GaveTo were:

Have to retrieve websites
Building a website scraper to retrieve social media links and find RSS/Atom feeds
Figuring out where nonprofits actually do work (not necessarily where their headquarters are)
Finding a clean NTEE code to description mapping, which is why I published this JSON file.

Thanks for kicking this off, look forward to helping out!

chadokruse commented 8 years ago

Sounds great David!

You bring up a number of great points and I just updated the README to (hopefully) capture some of them.

The first is the schema decision. The reason I've pushed that out until later in the roadmap is I want to make it as easy as possible for people to contribute their data during these early stages. I've seen a number of great schema efforts over the years (e.g. Schema.org's NGO schema), but none of them covered my specific use cases so I would have had to do all sorts of backflips to get my data to mesh with those efforts (e.g. we were working with many grassroots and international organizations, so our schema had to handle fiscally sponsored nonprofits, international projects run by individuals, and other atypical use cases).

That said, I could see where having an example schema might actually be helpful for some just to provide a starting point. Any chance you have time to put something together?

Looking forward to working with you on this!

PS - sounds like you've built some great scrapers. Did you build these from scratch or use a third party tool like Kimono?

dhenderson commented 8 years ago

I certainly hear you on not wanting a schema to be a barrier, perhaps the path forward is to setup a source to just dump a bunch of raw data and then checkout what we have? I'm not sure what the best approach is, but agree we don't want conforming to a particular schema to be a barrier to data sharing.

Regarding the scraper, I wrote that myself using Python. Basically I grabbed about 100k nonprofit websites from a nameless source and wrote a scraper that given a website, hits the homepage and scrapes it for social media contacts and RSS/Atom feeds.

I'm certainly happy to take an initial stab at a schema for a starting point. Alternatively, I'm good with just sharing what I have in terms of raw data and going from there.

chadokruse commented 8 years ago

Cool, yeah, I like the idea of just dumping all the raw data in the repo and re-assessing the schemas once we know what we're working with. Will be interesting to see the various schemas people have come up with, and even more interesting to see how widely they vary from the two main charitable data providers (Foundation Center and Guidestar).

Sounds like you've got some great data to add...great work on the scraper!

smartergiving / open-data

Project roadmap? #2