Resolve storage location of input/output data files

terryf82 commented 6 years ago

The current path to the data files

/boston-crash-modeling/osm-data

may be problematic from a Docker perspective. The container (executing image) is built by copying in the entire project repo at the point the image is created. This lets us build the image with a specific state of /boston-crash-modeling which is good, but we of course want to vary the data we're running against (for different cities). It also becomes problematic when you want to use the container to develop locally, by running it with an overloaded project code directory. The current setup doesn't allow for overloading just the code, you'd be overloading the code and data.

More broadly though, should we not be looking to automatically pull input data into the project separately from the code anyway, e.g from GoogleDrive, S3 etc? If this is to be developed as a pipeline, I think we want a certain degree of 'self-serve' to be included (get your city-specific files stored online, specify the urls and execute the app to generate predictions).

Please add any thoughts / opinions / questions, thanks.

andhint commented 6 years ago

@terryf82 We had talked about this before you joined. I'm personally really not a fan data.world. To me it seems not well suited for larger data collections that require multiple folders. I think having something where we could pull the input data in separately would be ideal, and I'm not sure if there's a way to do that with data.world.

terryf82 commented 6 years ago

@andhint Thanks, this is a good time to think about this while focusing on data standards and the data transformation phase of the project (basically everything that needs to happen around data prior to the generation of predictions starting). I think all we really need at the moment is reliable online storage that can be accessed in an automated way. Data.world does have an API (https://apidocs.data.world/v0/data-world-for-developers/intro) so I'll look into that further and come back with some ideas on usability.

j-t-t commented 6 years ago

Happy to switch. I know github has large file support so that might be worth looking at.

The one thing we just want to be aware of is that at the moment I believe some of the data that boston gave us isn't something they were yet willing to have publicly available. However, if we're moving away from using Boston's data anyway, maybe that matters less.

bpben commented 6 years ago

@terryf82 @andhint @j-t-t Where are we at with this? Seems like for now things live on data.world? Should we move this to 1.2?

j-t-t commented 6 years ago

Pushing it back for now makes sense.

terryf82 commented 6 years ago

@bpben I can't see any value in changing the setup for 1.1.

We should just ensure that we have a link to a complete copy of the 3 cities' runnable input data included in the documentation, which I uploaded the other day -

https://query.data.world/s/xtrjz2h2yar74ugtgkxi6yi4etwc23

@j-t-t can you just double check this zip has everything we need from your perspective?

insight-lane / crash-model

Resolve storage location of input/output data files #65