hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Cloud-based Data Science #223

Open DingoEatingFuzz opened 5 years ago

DingoEatingFuzz commented 5 years ago

For large datasets, improved collaboration, and link sharing, data science should be done ✨ In The Cloud ✨

Since most of our data science is done through the Python ecosystem, Jupyter Notebooks is the most obvious technology choice. R and RStudio comes in as a close second.

Ideally, we self-host this so we can take advantage of lower-latency dc locality and such, compared to open tools where data would have to be transferred over arbitrary distance and unknown network conditions.

Ideal solution

Other tools to look at

TODO

znmeb commented 5 years ago

My only concern with SageMaker is that it seems to be geared towards a machine learning workflow / mindset. I think that's a great strategic goal - TensorFlow is eating the world - but I'm not sure how well that fits the tactical situation. It's definitely accessible from R / RStudio. so it wouldn't lock R programmers out.

danieldn commented 5 years ago

Holding off until needs for sagemaker are clarified.

karenng-civicsoftware commented 5 years ago

@danieldn @DingoEatingFuzz did we decide to use sagemaker or not for people needing cloud access? The other type of cloud resources that we recommended was Google Colaboratory notebooks which are free.

DingoEatingFuzz commented 5 years ago

There is an open PR to introduce the sagemaker infrastructure: https://github.com/hackoregon/hackoregon-aws-infrastructure/pull/61

This will make it easy for us to provision notebook instances, but I still want to be conservative with when we do that, since it can be costly.

If someone is working with private data or large datasets and cannot do the work locally, they should request a notebook instance.