quiltdata / quilt

Quilt is a data mesh for connecting people with actionable data
https://quiltdata.com
Apache License 2.0
1.32k stars 91 forks source link

System-wide local storage #272

Closed ellisonbg closed 6 years ago

ellisonbg commented 6 years ago

I have a multi-user Linux server that I will be using for a Data Science class. I would like to use quilt to distribute datasets with Jupyter Notebooks to students. But I don't want all students to download their own copies of the data when using quilt. I see that quilt uses appdirs.user_data_dir() to get the directory to use, and that I can set XDG_DATA_HOME to override that location:

https://github.com/ActiveState/appdirs/blob/master/appdirs.py#L92

Thanks!

dimaryaz commented 6 years ago

Hi @ellisonbg,

XDG_DATA_HOME is used by lots of other programs, not just Quilt, so changing it could break things. Also, that directory is meant to be private, so sharing it between different users is a bad idea. (E.g., Quilt stores login credentials there, not just packages.)

Moving just the quilt_packages directory would be possible with simple code changes - though even that may be a bad idea: 1) malicious users could easily mess with others' data; 2) if two users started downloading the same package simultaneously, they'd be writing to the same files and likely corrupt the data.

The only "safe" way to do this would be a global directory that's readable by everyone, but only writeable by an admin who pre-installs a few useful packages there. Quilt almost supports reading packages from multiple directories; we've discussed using an environment variable like QUILT_PACKAGE_DIRS, but haven't actually implemented it yet. This would be the place to do it: https://github.com/quiltdata/quilt/blob/master/compiler/quilt/tools/store.py#L85

ellisonbg commented 6 years ago

Thanks for the update!

On Mon, Jan 8, 2018 at 12:26 AM, Dima Ryazanov notifications@github.com wrote:

Hi @ellisonbg https://github.com/ellisonbg,

XDG_DATA_HOME is used by lots of other programs, not just Quilt, so changing it could break things. Also, that directory is meant to be private, so sharing it between different users is a bad idea. (E.g., Quilt stores login credentials there, not just packages.)

Moving just the quilt_packages directory would be possible with simple code changes - though even that may be a bad idea: 1) malicious users could easily mess with others' data; 2) if two users started downloading the same package simultaneously, they'd be writing to the same files and likely corrupt the data.

The only "safe" way to do this would be a global directory that's readable by everyone, but only writeable by an admin who pre-installs a few useful packages there. Quilt almost supports reading packages from multiple directories; we've discussed using an environment variable like QUILT_PACKAGE_DIRS, but haven't actually implemented it yet. This would be the place to do it: https://github.com/quiltdata/ quilt/blob/master/compiler/quilt/tools/store.py#L85

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quiltdata/quilt/issues/272#issuecomment-355906139, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0ItTRkWvn_BWf8qeIwIs1csgfT1Jks5tIdEfgaJpZM4RV9ZQ .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

asah commented 6 years ago

hi Brian--

We're on it! and excited to add this to Quilt right away - this is an obvious and killer feature. In the meantime, can I ask a few "requirements" type questions:

  1. what are the goal(s) of sharing the storage? to save storage space? pre-install packages for users? something else?

  2. are all of your students are sharing a single Linux server/instance? if so, how are they isolated from one another (e.g. Linux user accounts?)? if not (separate instances), how do you plan to share the filesystem (e.g. NFS mounts?)? Can students run code (and install data) on their own computers/laptops? If so, is it OK that they have a copy of the data locally (e.g. disconnected operation)?

  3. if you're trying to save space, what are the numbers? # of files/dataframes? typical size of each? how many students? how much storage space do you have? etc.

  4. do you need to run your own registry? or can you have students use the normal public Quilt registry (which is free)? https://quiltdata.com/search/?q= If you're running your own registry instance, can we assume there's lots of bandwidth between the students' instance(s) and the registry (e.g. same AWS 'region')?

thanks! adam

ellisonbg commented 6 years ago

On Tue, Jan 9, 2018 at 1:25 PM, Adam Sah notifications@github.com wrote:

hi Brian--

We're on it! and excited to add this to Quilt right away - this is an obvious and killer feature. In the meantime, can I ask a few "requirements" type questions:

1.

what are the goal(s) of sharing the storage? to save storage space? pre-install packages for users? something else?

Great questions! Goals:

So mostly performance optimizations.

1. 2.

are all of your students are sharing a single Linux server/instance? if so, how are they isolated from one another (e.g. Linux user accounts?)? if not (separate instances), how do you plan to share the filesystem (e.g. NFS mounts?)? Can students run code (and install data) on their own computers/laptops? If so, is it OK that they have a copy of the data locally (e.g. disconnected operation)?

Yes, in this case, the students are on a shared Ubuntu 16.04 server. They have standard shell accounts, and we run their single user Jupyter notebook servers not in a Docker container. However, we have other configurations that are similar, but with users running in containers.

1. 2.

if you're trying to save space, what are the numbers? # of files/dataframes? typical size of each? how many students? how much storage space do you have? etc.

This particular instance has 1 TB EBS SSD volume, but it varies by deployment.

The rough number of datasets might be a few dozen over the course of a quarter. Most are small (1-10MB) but towards the end of the quarter, students start to do projects with a few multi-GB datasets (that size CSV files).

1. 2.

do you need to run your own registry? or can you have students use the normal public Quilt registry (which is free)? https://quiltdata.com/search/?q= https://quiltdata.com/search/?q= If you're running your own registry instance, can we assume there's lots of bandwidth between the students' instance(s) and the registry (e.g. same AWS 'region')?

I don't think we need our own registry - unless, we start to have large enough data that it is a problem using the public registry. In a classroom setting like this, there is a rather nice pressure to keep datasets small. otherwise it is hard for students to finish things on tight deadlines.

Thanks!

Cheers,

Brian

thanks! adam

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quiltdata/quilt/issues/272#issuecomment-356418944, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0OhuNtDcmL6odJfh107oY7oOajE7ks5tI9kugaJpZM4RV9ZQ .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

asah commented 6 years ago

In that case, I think we have good news! A preview: https://github.com/quiltdata/quilt/pull/286

Obviously, there will be docs and examples showing how to set this up and manage it... but basically, an admin can designate one or more shared director(ies) which clients access (read-only, via import) by setting an environment variable (QUILT_PACKAGE_DIRS). If a package isn't available in the user's local directory, it checks the shared directories. A share directory is simply the "local" quilt_packages directory of the admin account on the same (network) file system.

We have a few other features/changes queued for master & release (to pypi aka pip install) over the next few days, but if you're feeling adventurous, you're welcome to try this right now. We'd love the feedback.

ellisonbg commented 6 years ago

Wow, thanks! I will definitely give this a test and report back.

On Wed, Jan 10, 2018 at 11:38 AM, Adam Sah notifications@github.com wrote:

In that case, I think we have very good news coming your way. A preview:

283 https://github.com/quiltdata/quilt/pull/283

Obviously, there will be docs and examples showing how to set this up and manage it... but basically, an admin can designate one or more shared director(ies) which clients access read-only by setting an environment variable (QUILT_PACKAGE_DIRS). If a package isn't available in the user's local directory, it checks the shared directories.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quiltdata/quilt/issues/272#issuecomment-356713104, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0Dn0nMN3tc4fZxNuiWany_kk_i71ks5tJRGngaJpZM4RV9ZQ .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

akarve commented 6 years ago

@ellisonbg We've moved the work to a new PR, #286

akarve commented 6 years ago

Resolved in #286 and added to the docs here. Let us know if you run into any hiccups.

ellisonbg commented 6 years ago

Many thanks!

On Tue, Jan 30, 2018 at 9:27 AM, Aneesh Karve notifications@github.com wrote:

Closed #272 https://github.com/quiltdata/quilt/issues/272.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quiltdata/quilt/issues/272#event-1449062290, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0NpPGyQt469AIb-bgIsr1h7WDofPks5tP1DugaJpZM4RV9ZQ .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

akarve commented 6 years ago

@ellisonbg: @dimaryaz will get back to you with which hash to use on master; these changes aren't on pip yet

akarve commented 6 years ago

Keeping issue open until release hits PyPI (better experience for students who install Quilt).

akarve commented 6 years ago

Installing Quilt from top of tree master will work for @ellisonbg and students:

[redacted :)]
akarve commented 6 years ago

Whoops. The above does not work with the new project structure. Will circle back with installation instructions.

ellisonbg commented 6 years ago

ok!

On Wed, Jan 31, 2018 at 12:15 PM, Aneesh Karve notifications@github.com wrote:

Whoops. The above does not work with the new project structure. Will circle back with installation instructions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quiltdata/quilt/issues/272#issuecomment-362057071, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0KwuiVavteuT3G7-ysWu88dh9SVKks5tQMnlgaJpZM4RV9ZQ .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

dimaryaz commented 6 years ago

@ellisonbg: here's the correct command to install from git master:

pip install --user 'git+https://github.com/quiltdata/quilt#subdirectory=compiler'

(You may also want to use --user if you're not installing as root.)

ellisonbg commented 6 years ago

Ahh, thanks!

On Wed, Jan 31, 2018 at 9:13 PM, Dima Ryazanov notifications@github.com wrote:

@ellisonbg https://github.com/ellisonbg: here's the correct command to install from git master:

pip install --user 'git+https://github.com/quiltdata/quilt#subdirectory= compiler'

(You may also want to use --user if you're not installing as root.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quiltdata/quilt/issues/272#issuecomment-362160758, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0N_y213pEJykA_gmoySRHd-j_2-Rks5tQUgKgaJpZM4RV9ZQ .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

akarve commented 6 years ago

Docs now sync'd with gitbook at https://docs.quiltdata.com/shared-store.html

akarve commented 6 years ago

Merged and released to pip: https://github.com/quiltdata/quilt/releases/tag/2.9.0 https://pypi.python.org/pypi/quilt

Now it's just pip install quilt for system-wide local storage.