openaq / openaq-upload

Batch uploader for OpenAQ
MIT License
2 stars 0 forks source link

Scope uploader v1 #1

Open olafveerman opened 7 years ago

olafveerman commented 7 years ago

Ticket to track and discuss the overall scope of openaq-upload v1 (still looking for witty project name). This uploader will live on something like http://upload.openaq.org and provide researchers and scientists with a way to bulk upload measurements in CSV format.

Authentication

Data can be uploaded by people with a Bulk Upload Key. These keys are managed by the OpenAQ moderators. We will implement a light system to manage keys and associated metadata like email address. This will be heavily borrowed from OAM

@danielfdsilva Do we need to set up a separate project for this?

Upload workflow

The main upload workflow will be:

  1. go to form
    1. Provide BUK
    2. Attach CSV file for verification
    3. Hit the button 'Verify Data'. This will trigger a loading indicator
  2. transform CSV into JSON in browser
  3. verify JSON in browser, using JSON schema. This checks for:
    1. if all the required headers are there
    2. make sure that parameters are numbers, that dates are in correct format, lon lat, etc
    3. check if there is already a location/city in the system. If so, present a warning with a nice help text, saying that source X has data for that city/location. This shouldn't fail since it can be a legitimate upload.
  4. on the next page, an overview of the Verification is presented to the user. This includes: summary (# rows), the warnings and errors
  5. when there are no errors and the user confirms the upload:
    1. form makes request to Openaq API
    2. API authenticates the token and returns a URL (@jflasher implements this endpoint)
    3. form makes PUT request with the URL
    4. CSV stored on a S3 bucket
  6. form gives nice feedback message. You should be receiving an email in x minutes

    Behind the scenes

The CSV file will be treated as any other source. A specific batch_upload adapter checks the S3 bucket every 10 minutes. If file detected, adapter processes and stores data. On success, will send an email to associated person.


Comments, questions? @jflasher @ascalamogna @danielfdsilva @nbumbarger @RocketD0g

RocketD0g commented 7 years ago

For reference, my guess is that researchers will share location-specific data (perhaps in batches of different locations). I think most typically they will have a txt or csv file spit out from a single instrument (that may measure 1 or more pollutants) at an individual geographic location. They will modify that single-instrument, single location file to fit the criteria we specify as csv to upload. Does that make sense? Anyone have other thoughts?

Don't know if we will have any form that captures data common to all datapoints from the same location (e.g. geographic coordinates, attribution, etc.) or if it is best to have the entire information repeated in the file for each data point?

danielfdsilva commented 7 years ago

@olafveerman server side logic can live with the uploader, but we'll need a different project for the admin panel.

jflasher commented 7 years ago

@danielfdsilva what sort of server side logic do you think the uploader will need? The piece to get a signed url for S3 put should live in the API I think.

Just talked with @RocketD0g about the admin panel and I think we can not implement that at this point. I don't think the time/difficulty needed to implement will be worth the number of people expected to use it. The other sort of tricky bit here is that the database is now shared publicly. So I'm not sure how we'll store the private upload keys (maybe encrypted?) anyways. Then I thought maybe we could use a separate DB just for this, but that doesn't seem great either.

My recommended approach is to have an email link on the form where they can request a token (sent to info@openaq.org). I'll pregen 10 tokens or so that can be given back out to people. This seems like an easier approach for expected amount of usage and let's us focus on some of the other things.

Sound good to everyone?

danielfdsilva commented 7 years ago

@jflasher true that the token management system has a couple of endpoints, but we'd be reusing the OAM one, so the only work involved would be changing the style to fit openaq's. And yes, we'd need a mongo db to store the keys.

The pregen tokens is a simpler approach, but not as easy to manage

jflasher commented 7 years ago

I think for now let’s stick with pregen. Definitely agree it’s not as easy to manage, but don’t want to have to manage the admin instance and a mongodb instance just for this.

On October 19, 2016 at 8:30:10 AM, Daniel da Silva (notifications@github.com) wrote:

@jflasher true that the token management system has a couple of endpoints, but we'd be reusing the OAM one, so the only work involved would be changing the style to fit openaq's. And yes, we'd need a mongo db to store the keys.

The pregen tokens is a simpler approach, but not as easy to manage

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.