openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
462 stars 74 forks source link

Show users who downloaded data / used the api for a scraper #595

Closed mlandauer closed 9 years ago

mlandauer commented 9 years ago

Related to #122

This gives users a sense of who and how many people are using the data they've scraped. Also by adding visibility we potentially foster more interaction between users and hopefully more sense of community and more scraping...

equivalentideas commented 9 years ago

As morph.io is the scraper platform for civic hackers, are there privacy concerns here for activists or journalists using morph.io having their research activity made public and searchable?

I feel like this is an issue, but haven't seen any cases on point. Just feels :S

If there are privacy issues, we don't have to make this completely public to get the benefits above. We could just show Scraper Owners or Logged in users. We could also phase this in and see how it goes.

equivalentideas commented 9 years ago

Here's a few examples of how this is done on some other platforms:

Freesound.org

Repo of open field recordings. Shows count and people who've downloaded a recording page

screen shot 2015-04-29 at 11 10 54 am

and the user names of people who've downloaded on a recording/downloaders page, it separated by day:

screen shot 2015-04-29 at 12 00 36 pm

Github

Similar concept, Github shows people who've forked a repo:

screen shot 2015-04-29 at 11 18 00 am

There display of contributors count in the prominently beneath the description is also worth noting:

screen shot 2015-04-29 at 11 17 11 am

mlandauer commented 9 years ago

@equivalentideas I think it's important to think through the privacy implications

I'm keen on trying, as an experiment, to make public who downloads/uses data because it has the potential to slightly redress the imbalance between the people that create/scrape data and publish it for free on morph.io and those that potentially use/abuse it.

More concretely, if I put in the effort to scrape some data and give it away for free, shouldn't I be able to see who downloads it and potentially see what they do with it as well? It seems like a fair bargain to me.

That's the reason I would like to see who uses the data but there are plenty of reasons why it could be a privacy issue.

The most important thing is to make it very clear to users that this information might get shown.

If people want privacy they can create a fake identity on github and use that to download data.

We could potentially offer an "opt-out" to hide this kind of public activity.

However, given the nature of what morph.io is I think it's important to maintain an "open by default" approach. It enables serendipity and fosters community.

mlandauer commented 9 years ago

I like the way that freesound shows that information with the summary and more detailed view. Maybe the summary could say "Downloaded 627 times by 34 users"?

equivalentideas commented 9 years ago

I'm with you, I think it would be great for the community to show this information to connect people scraping and using data.

However, given the nature of what morph.io is I think it's important to maintain an "open by default" approach. It enables serendipity and fosters community.

I agree with this. Maybe when we implement this we do it with an admin only featureflag at first so we can have a bit of a look through to see if there are any obvious issues?

The most important thing is to make it very clear to users that this information might get shown.

:+1: :+1: I agree this crucial.

Thinking about the privacy risks, I guess a scenario is someone scraping data about a company or powerful institution institution that then attacks them in some way as a response. That could be very serious, and we definitely don't want people to expose themselves accidentally.

mlandauer commented 9 years ago

Thinking about the privacy risks, I guess a scenario is someone scraping data about a company or powerful institution institution that then attacks them in some way as a response. That could be very serious, and we definitely don't want people to expose themselves accidentally.

@equivalentideas I think it's interesting that you bring this one up as an example. That is an issue already. If I write a scraper then I'm already the biggest target. See for instance https://morph.io/mlandauer/australian_food_products

equivalentideas commented 9 years ago

That is an issue already. If I write a scraper then I'm already the biggest target.

Good point. Scraper authors are definitely exposed to risk here. You could also make a case that people using data is reporting and activism could be more threatening to powerful institutions than the scraper authors themselves in some cases.

I think we should consider the privacy risks of writing scrapers and downloading data separately.

People downloading data don't necessarily have skills to write scrapers, and we might assume aren't quite as familiar with online privacy/information leak.

equivalentideas commented 9 years ago

Downloaded 627 times by 34 users

@mlandauer is the data for this available? Like scraper.downloads.count ? scrpaer.downloads.users.count?

equivalentideas commented 9 years ago

There are a few different ideas going on in this thread so I want to bring it back to getting us to a first iteration.

For the first pass at this, whats the most basic information we can add to a scraper?

Data we could show includes:

There are a lot of API queries happening so that seems initially a good thing to add to get an idea of how scrapers are being used. We could show a summary as "7235 API queries by 3 users" at the top of the 'Data' section:

2015-04-30 12 33 14

I think we should initially add this to the scraper show page, and then the index as an iteration.

We could show the names/avatars of users underneath the heading as a closure (maybe second iteration).

@mlandauer am I on the right track with this?

mlandauer commented 9 years ago

That sounds good and the idea of the first step seems good to me. I also like your idea of doing it in small steps.

One thing to bear in mind is that what's stored in the "api queries" table is actually a bunch of different things. It's downloads of a whole table in the UI (appears as type sql and format csv), downloads of the whole sqlite database (appears as type sqlite) and queries through the api (which appear as type sql and any format including csv).

So maybe we can use a word that covers all of those, for example "Downloaded 7843 times by 1 user".

Also, we should use the word user to be consistent with the language in the rest of the site.

equivalentideas commented 9 years ago

One thing to bear in mind is that what's stored in the "api queries" table is actually a bunch of different things. It's downloads of a whole table in the UI (appears as type sql and format csv), downloads of the whole sqlite database (appears as type sqlite) and queries through the api (which appear as type sql and any format including csv).

That's why I was so confused by the admin thing :)

Thanks @mlandauer , I think we're on the same track :+1:

I'll do a static version as a first pass in a PR. Should I leave the dynamic bit up do you or will it be a simple APIQueries.where(scraper: scraper.id).size type thing?

mlandauer commented 9 years ago

@equivalentideas when I say a static template don't put in any dynamic elements. Just put in some random numbers. That way you don't have to worry or think about that and you can focus on the design stuff.

After that we can decide whether you want to implement the dynamic stuff or whether I'll do it.

The important thing is don't think about the implementation yet! :-)

mlandauer commented 9 years ago

One more thing to consider the downloaders can also be organisations. This will happen in the case where an api key is used which belongs to an organisation. This is how PlanningAlerts applications are downloaded for example. So, the design will need to incorporate that in some way

equivalentideas commented 9 years ago

Here's one layout option. I think on wide screens it needs a line or something to make the downloads info less floaty.

screen shot 2015-04-30 at 2 22 59 pm screen shot 2015-04-30 at 2 22 54 pm

Here's a version with the info on the left next to the heading—I think this is the most simple path initially.

screen shot 2015-04-30 at 2 29 50 pm

equivalentideas commented 9 years ago

If there's a small number of users you could just show their avatar instead of the count.

equivalentideas commented 9 years ago

The static version of this is on https://github.com/openaustralia/morph/tree/show_scraper_usage

equivalentideas commented 9 years ago

This is for a later iteration, but I was just looking at this idea and thought, it's also interesting to know how something has been used overtime, particularly if you haven't looked at one of your scrapers in a while, you come back and suddenly the usage has gone up loads. A subtle spark line next to the figure could show this nicely:

2015-04-30 14 46 30

mlandauer commented 9 years ago

Please incorporate this:

One more thing to consider the downloaders can also be organisations. This will happen in the case where an api key is used which belongs to an organisation. This is how PlanningAlerts applications are downloaded for example. So, the design will need to incorporate that in some way

equivalentideas commented 9 years ago

One more thing to consider the downloaders can also be organisations. This will happen in the case where an api key is used which belongs to an organisation. This is how PlanningAlerts applications are downloaded for example. So, the design will need to incorporate that in some way

In the first iteration it seem simplest not to distinguish users and orgs (also, what's the difference to the citizen?), but if we show the avatars of downloaders (with name on tooltip) then it's clear whether it's a person or organisation. I'll add that to the static prototype #679 .

equivalentideas commented 9 years ago

@mlandauer we should resolve the privacy discussion here in light of the current implementation before deploying this I think.

equivalentideas commented 9 years ago

The most important thing is to make it very clear to users that this information might get shown.

Users haven't had any hint that their activity is about to get shown. Is this a problem?

One idea: only show the dowloaders list for people who've downloaded after we deploy this, and do a little blog post about it and email it round/put a notice on the site or something—so that people using the api will know they're getting revealed. We could even hold back the reveal or something and give people a chance to stop using it or give us some feedback before it's shown.

Not sure.

mlandauer commented 9 years ago

I think we should deploy it hidden behind a feature flag. So that we can switch the feature on for us and just get a sense of what's going on before we decide what to do next.

equivalentideas commented 9 years ago

That sounds perfect :+1:

mlandauer commented 9 years ago

I'll make the change for that on the same branch of the PR

mlandauer commented 9 years ago

We have a version of this deployed now that only enables the viewing of this info for users with the feature flag set. Currently this is only @mlandauer, @equivalentideas and @henare.

We're doing this to figure out if and how we're going to resolve any privacy issues with this feature.

Whatever we do we should add some documentation / notice explaining that this information is recorded and shown.

Let me try to articulate some options beyond that:

  1. We just enable it wholesale for everyone including with retrospective information.
  2. We only publicly show information for downloads after the date we added the disclaimer.
  3. We allow users to opt-out of this information being shown for their downloads

Gut feeling is that there are other ways to do this to. Please suggest any that come to mind.

Out of these my preference would be option 1 or 2. Option 1 is the simpler one and goes along with the ask for forgivness approach but you could strongly argue that you shouldn't apply that to privacy! So, that for me tips the balance in favour of option 2.

equivalentideas commented 9 years ago

This is a good summary of where this is at.

I think 2 is the best from these options, as there's least potentially nasty surprises for our lovely citizens.

equivalentideas commented 9 years ago

Is there a way we could iterate from 2 to 1? See if anyone has an issue for a while and then extend it back? Would that just be really confusing?

mlandauer commented 9 years ago

Based on feedback from @henare and @equivalentideas I think we'll go with option 2.

equivalentideas commented 9 years ago

On the display of downloaders here, it would be nice to order them by downloads, or most recent or something meaningful.

equivalentideas commented 9 years ago

I'm interested if there's any observations we can make about download habits from this so far? Are there ways we could improve this display?

henare commented 9 years ago

On the display of downloaders here, it would be nice to order them by downloads, or most recent or something meaningful.

I initially was surprised it wasn't sorted by number of downloads but then assumed it was by most recent access.

mlandauer commented 9 years ago

@henare I think it would be best sorted by number of downloads. That was actually something I was intending but stupidly didn't write down.

equivalentideas commented 9 years ago

So our plan is "We only publicly show information for downloads after the date we added the disclaimer". We'll need to notify people somehow.

What's the best way to get notified that your formally hidden activity is going to become public? A blog post, an alert banner on the top of the site, an email, a popup when you click download?

We should also remember we think this will be an awesome new feature for most users, so we should spruke it :)

Some users are downloading via the api and may not visit the site. I think we need an email out to reach these people to let them know that in x number of days there name will be shown. It may be worth emailing all users with this as it's also a nice reminder of morph and that new things are being added.

I think a blog post explaining why we're doing this, which we can tweet out.

I think a banner on the site to let people know about the new feature with a link to the blog post or 'what'new' page is also a good idea, for users who aren't on twitter and miss the email.

So my initial ideas are an email, a banner on the site, a blog post, update What's new (drawing from the blog post), tweets.

What do you guys think about an alert the first time you go to download a scraper?

equivalentideas commented 9 years ago

There's also the ongoing need to alert new users. When a user goes to download a scraper they should see the list of downloaders and put-two-and-two-together. But many scrapers have no downloads so they potentially wont know their name will be added.

@henare came up with a nice solution for this situation the other day. Here's what I remember:

I go to download a scraper. Next to the 'Downloaded 0 times' text is a light/transparent version of your avatar. with small text explaining that you'll be added.

With the kinks worked out, something like that could be a simple and very clear way to communicate what will be shown.

@henare might be able to explain this better.

mlandauer commented 9 years ago

I think we should do this with a light touch. I think a clearly worded message next to the download buttons and an explanation on the API page would suffice. It is more than likely than anyone downloading stuff after those messages are there would become aware of what the new plan is.

I don't think it's necessary to email anybody. It won't necessary make more people understand or know what is happening (because they likely won't pay attention until it's relevant to them) but it might needlessly stir people up.

@henare It would be great if you could explain the idea a little more

equivalentideas commented 9 years ago

I think we should do this with a light touch. I think a clearly worded message next to the download buttons and an explanation on the API page would suffice. It is more than likely than anyone downloading stuff after those messages are there would become aware of what the new plan is.

I don't think it's necessary to email anybody. It won't necessary make more people understand or know what is happening (because they likely won't pay attention until it's relevant to them) but it might needlessly stir people up.

It the clear light of morning, I think you're right :) A message next to the information is the simplest approach for sure.

I still think that maybe in a little while, when we see how this feature get's used, a blog post about it could be really nice, as it's quite an interesting feature for other civic tech developers.

So immediate todos:

mlandauer commented 9 years ago

:+1:

I still think that maybe in a little while, when we see how this feature get's used, a blog post about it could be really nice, as it's quite an interesting feature for other civic tech developers.

equivalentideas commented 9 years ago

Is this kind of what you were thinking for the scraper page notice @mlandauer ?

screen shot 2015-05-07 at 12 45 52 pm

equivalentideas commented 9 years ago

For the notice on the API page we could use a panel with a warning style:

screen shot 2015-05-07 at 12 53 55 pm

Privacy

To promote collaboration on morph.io we show the list of people who have downloaded a scraper. This includes people downloading the data through API requests. We hope you can use this feature to connect with other people with similar interests to you.

If it is important for you to protect your identity in downloading data from morph.io, you could make an alterate github account and email that you use for this work. If this is not an option, please contact us. We'll do our best to help if you explain you're circumstances. Morph is the scraping platform for civic hackers, activists and researchers. We understand there are times when this work must be protected.

This is probably too long, but just wanted to get down the main points for the api page.

mlandauer commented 9 years ago

@equivalentideas On the scraper page I think it would be nice if the notice was closer to the download button because it's when looking at those that the information is most useful

equivalentideas commented 9 years ago

@equivalentideas On the scraper page I think it would be nice if the notice was closer to the download button because it's when looking at those that the information is most useful

:+1: I'll do a PR with something simple and we can go from there.

equivalentideas commented 9 years ago

screen shot 2015-05-07 at 2 39 24 pm screen shot 2015-05-07 at 2 39 37 pm

Ideas for text for the notice, trying to say "Your avatar will be shown on this page once you have downloaded this data" but short, simple, and from their perspective:

equivalentideas commented 9 years ago

@henare suggested I make it a bit more obvious will a colour or alert style of some kinda. Gonna give that a go.

mlandauer commented 9 years ago

I think I like "Your downloads will be shown above" the best. Just replacing the "displayed" with the simpler and shorter "shown".

I think there should also be a link to a short paragraph explaining things in a little more detail - more specifically why we're doing this (to encourage serendipity and allow people to understand who and how people are using the data).

The link could be something as simple as "Why?"

Your downloads will be shown above. Why?