uwescience / myria-web

Web frontend for Myria
https://demo.myria.cs.washington.edu
Other
11 stars 14 forks source link

deleting tables from myria? #261

Closed sophieclayton closed 9 years ago

sophieclayton commented 9 years ago

is this something that is currently possible? will it be implemented in future? do different tables have different permissions depending on who the owner is? (i.e. the user/project defined for the table when it is uploaded or created)

dhalperi commented 9 years ago

Excellent questions!

is this something that is currently possible?

No, although datasets in Myria are never responsible for disk space issues -- Myria takes a small fraction of the available disk space.

will it be implemented in future?

Absolutely!

do different tables have different permissions depending on who the owner is? (i.e. the user/project defined for the table when it is uploaded or created)

Not currently, but they will! And this questions gets at the reason it's not currently implemented... :)

dhalperi commented 9 years ago

found at least 1 relevant issue here: https://github.com/uwescience/myria/issues/537

jortiz16 commented 9 years ago

I know this has already been marked as important and there are reasons as to why it is not implemented yet. Just want to add this info to this post. Disks are getting full (algol is at 100% at the moment) and I already have a few tables I would like to delete (~300GB) on some nodes:

                      relation                                |  size   
--------------------------------------------------------------+---------
 public.public:adhoc:snapshot1818                             | 15 GB
 public.public:adhoc:snapshot1818Hash                         | 15 GB
 public.public:adhoc:snapshot1745                             | 15 GB
 public.public:adhoc:snapshot1745Hash                         | 15 GB
 public.public:adhoc:snapshot1648                             | 15 GB
 public.public:adhoc:snapshot1648Hash                         | 15 GB
 public.public:adhoc:snapshot1556                             | 15 GB
 public.public:adhoc:snapshot1556Hash                         | 15 GB
 public.public:adhoc:snapshot1536                             | 15 GB
 public.public:adhoc:snapshot1536Hash                         | 15 GB
 public.public:adhoc:snapshot1469                             | 15 GB
 public.public:adhoc:snapshot1469Hash                         | 15 GB
 public.public:adhoc:snapshot1387                             | 15 GB
 public.public:adhoc:snapshot1380Hash                         | 15 GB
 public.public:adhoc:snapshot1387Hash                         | 15 GB
 public.public:adhoc:snapshot1236Hash                         | 15 GB
 public.public:adhoc:snapshot1167Hash                         | 15 GB

I mean, I could delete these relations from postgres directly. Would that be an okay workaround for now? I'm afraid the catalog will keep information about tables that do not exist anymore. I still have about 1.5TB I want to ingest into myria soon.

Another thing I could do is delete raw data from HDFS? Or move it to S3?

billhowe commented 9 years ago

What are the reasons it hasn't been implemented?

I'd say dive in and add the feature!!

On Wed, Mar 25, 2015 at 10:24 AM, jortiz16 notifications@github.com wrote:

I know this has already been marked as important and there are reasons as to why it is not implemented yet. Just want to add this info to this post. Disks are getting full (algol is at 100% at the moment) and I already have a few tables I would like to delete (~300GB) on some nodes:

              relation                           |  size

--------------------------------------------------------------+--------- public.public:adhoc:snapshot1818 | 15 GB public.public:adhoc:snapshot1818Hash | 15 GB public.public:adhoc:snapshot1745 | 15 GB public.public:adhoc:snapshot1745Hash | 15 GB public.public:adhoc:snapshot1648 | 15 GB public.public:adhoc:snapshot1648Hash | 15 GB public.public:adhoc:snapshot1556 | 15 GB public.public:adhoc:snapshot1556Hash | 15 GB public.public:adhoc:snapshot1536 | 15 GB public.public:adhoc:snapshot1536Hash | 15 GB public.public:adhoc:snapshot1469 | 15 GB public.public:adhoc:snapshot1469Hash | 15 GB public.public:adhoc:snapshot1387 | 15 GB public.public:adhoc:snapshot1380Hash | 15 GB public.public:adhoc:snapshot1387Hash | 15 GB public.public:adhoc:snapshot1236Hash | 15 GB public.public:adhoc:snapshot1167Hash | 15 GB

I mean, I could delete these relations from postgres directly. Would that be an okay workaround for now? I'm afraid the catalog will keep information about tables that do not exist anymore. I still have about 1.5TB I want to ingest into myria soon.

Another thing I could do is delete raw data from HDFS? Or move it to S3?

— Reply to this email directly or view it on GitHub https://github.com/uwescience/myria-web/issues/261#issuecomment-86131017 .

Bill Howe Associate Director and Senior Data Science Fellow, UW eScience Institute Affiliate Faculty, Computer Science & Engineering University of Washington To acknowledge eScience: "Supported in part by the University of Washington eScience Institute"

dhalperi commented 9 years ago

Many of the students wanted to put this behind user authentication, so that "you can't delete my tables".

mbalazin commented 9 years ago

One approach would be to add a button to the web front-end that enables table deletion BUT is clearly marked as "danger zone". Could we split this issue into two: One person could implement the python side of table deletion (update catalogs and drop from PostgreSQL) and another person changes the GUI? Anyone comfortable with making either change?

BrandonHaynes commented 9 years ago

Random workaround idea: ingest a single tuple using the name of an existing relation, and set argOverwriteTable=True on the DbInsert operator. You could do this via Myria-Python using my branch.

Also -- would it be helpful to have a program name "temporary" where relations are just automatically reclaimed after some period of time? This might be a workaround for explicit deletion, since it wouldn't really require authorization.

mbalazin commented 9 years ago

Clever workaround and I like the idea of explicit temp relations. magda

Magdalena Balazinska Jean Loup Baer Professor of Computer Science and Engineering Associate Professor, Dept. of Computer Science & Engineering Director of the IGERT PhD Program in Big Data / Data Science Senior Data Science Fellow of the eScience Institute University of Washington

On Wed, Mar 25, 2015 at 10:49 AM, Brandon Haynes notifications@github.com wrote:

Random workaround idea: ingest a single tuple using the name of an existing relation, and set argOverwriteTable=True on the DbInsert operator. You could do this via Myria-Python using my branch.

Also -- would it be helpful to have a program name "temporary" where relations are just automatically reclaimed after some period of time? This might be a workaround for explicit deletion, since it wouldn't really require authorization.

— Reply to this email directly or view it on GitHub https://github.com/uwescience/myria-web/issues/261#issuecomment-86144326 .

billhowe commented 9 years ago

On Wed, Mar 25, 2015 at 10:48 AM, Magdalena Balazinska < notifications@github.com> wrote:

One approach would be to add a button to the web front-end that enables table deletion BUT is clearly marked as "danger zone". Could we split this issue into two: One person could implement the python side of table deletion (update catalogs and drop from PostgreSQL) and another person changes the GUI? Anyone comfortable with making either change?

This seems right. (Although the table deletion and catalog update is in the java rather than python, since python doesn't interact with Postgres.)

(I was typing the following when Magda's message came in)

Here's what I'd recommend:

[] Write the appropriate java method to delete all partitions from all postgres instances and update the catalog. To avoid half-deleted datasets, use TRUNCATE inside of a transaction, and rollback if anything goes wrong. Once successfully TRUNCATE'd and catalog is updated, commit all the transactions and drop the tables.

[] Expose this method as a REST call, and expose it in myria-web.

[] Don't expose a button yet; just provide a secret url that allows deleting.

[] OPTIONAL: Add an "undeleteable" flag to certain datasets in the catalog so they can't be removed accidentally. We have a few "precious" datasets that would be a pain to reload, and we never want to get rid of.

At this point, deploy. I believe accidental deletions won't happen.

Then:

[] Merge master to the myria-web branch that has authentication, test and deploy it: https://github.com/uwescience/myria-web/blob/myria_auth

This allows people to authenticate with google, but does nothing with the credentials. Need to test/fix that the generated javascript interface is always using authenticated HTTP calls. I think that was true or mostly true.

[] In the myria-web delete call, check that the user is authenticated, and optionally verify that the table they are trying to delete matches their username.

[] Clean up the catalog to assign the table names to the appropriate users.

dhalperi commented 9 years ago

There is enough complexity in this change that one person should really do all of it.

billhowe commented 9 years ago

I'm in favor of both of these suggestions!!

Open a new issue for the "temporary" reclaim job?

And, can we capture an example program for inserting a single tuple via MyriaL?

On Wed, Mar 25, 2015 at 10:49 AM, Brandon Haynes notifications@github.com wrote:

Random workaround idea: ingest a single tuple using the name of an existing relation, and set argOverwriteTable=True on the DbInsert operator. You could do this via Myria-Python using my branch.

Also -- would it be helpful to have a program name "temporary" where relations are just automatically reclaimed after some period of time? This might be a workaround for explicit deletion, since it wouldn't really require authorization.

— Reply to this email directly or view it on GitHub https://github.com/uwescience/myria-web/issues/261#issuecomment-86144326 .

Bill Howe Associate Director and Senior Data Science Fellow, UW eScience Institute Affiliate Faculty, Computer Science & Engineering University of Washington To acknowledge eScience: "Supported in part by the University of Washington eScience Institute"

dhalperi commented 9 years ago

Why 1 tuple?

a = empty(x:int);
store(a, public:adhoc:snapshot1818);

This will automatically be run on every worker.

jortiz16 commented 9 years ago

I can work on the deletion for the postgres/myria catalog later starting next week since I really need it. Would this be a bad first job to pick up given I'm not familiar with developing in Myria? Or perhaps I can partner up with someone else too?

billhowe commented 9 years ago

On Wed, Mar 25, 2015 at 11:17 AM, jortiz16 notifications@github.com wrote:

I can work on the deletion for the postgres/myria catalog later starting next week since I really need it. Would this be a bad first job to pick up given I'm not familiar with developing in Myria? Or perhaps I can partner up with someone else too?

THANKS JEN!!! You're a champ. :)

My blindly optimistic view is that there is no bad first job. :)

dhalperi commented 9 years ago

Just a minor reiteration that, while there is a lot of storage in Postgres, the majority of the data in postgres is in fact not in the production database.

If you look at the 4 hard drives on Algol: /disk4 has ~630GB more free than /disk1. HDFS and Myria-production use disk space evenly across the entire cluster, so that extra space is not attributable to either of these.

In other words, the data you really need to delete is probably in individual students' experiments, which are not balanced properly across the cluster.

dhalperi commented 9 years ago

Interestingly, this implementation is of course going to be significantly complicated by views. It seems that someone has created a view on snapshot1818Hash, so it cannot be deleted without the addition of the CASCADE command. If that view is in the Catalog, deleting the table, especially cascading, would be a breaking change.

What should be done in this case? See: https://github.com/uwescience/myria/issues/728

I would say ignore views for now, but then whoever adds view support for real needs to handle this case appropriately.

billhowe commented 9 years ago

Postgres-level views are not part of our model, AFAIK. How did they get there? How does one create a view?

On Wed, Mar 25, 2015 at 11:34 AM, Daniel Halperin notifications@github.com wrote:

Interestingly, this implementation is of course going to be significantly complicated by views. It seems that someone has created a view on snapshot1818Hash, so it cannot be deleted without the addition of the CASCADE command. If that view is in the Catalog, deleting the table, especially cascading, would be a breaking change.

What should be done in this case? See: uwescience/myria#728 https://github.com/uwescience/myria/issues/728

I would say ignore views for now, but then whoever adds view support for real needs to handle this case appropriately.

— Reply to this email directly or view it on GitHub https://github.com/uwescience/myria-web/issues/261#issuecomment-86162491 .

Bill Howe Associate Director and Senior Data Science Fellow, UW eScience Institute Affiliate Faculty, Computer Science & Engineering University of Washington To acknowledge eScience: "Supported in part by the University of Washington eScience Institute"

dhalperi commented 9 years ago

I assume someone did it manually. However, the non-Postgres version of a view makes this even more complicated because Postgres won't catch it for you ;)

— Sent from my phone

On Wed, Mar 25, 2015 at 11:45 AM, billhowe notifications@github.com wrote:

Postgres-level views are not part of our model, AFAIK. How did they get there? How does one create a view? On Wed, Mar 25, 2015 at 11:34 AM, Daniel Halperin notifications@github.com wrote:

Interestingly, this implementation is of course going to be significantly complicated by views. It seems that someone has created a view on snapshot1818Hash, so it cannot be deleted without the addition of the CASCADE command. If that view is in the Catalog, deleting the table, especially cascading, would be a breaking change.

What should be done in this case? See: uwescience/myria#728 https://github.com/uwescience/myria/issues/728

I would say ignore views for now, but then whoever adds view support for real needs to handle this case appropriately.

— Reply to this email directly or view it on GitHub https://github.com/uwescience/myria-web/issues/261#issuecomment-86162491 .

Bill Howe Associate Director and Senior Data Science Fellow, UW eScience Institute Affiliate Faculty, Computer Science & Engineering University of Washington To acknowledge eScience: "Supported in part by the University of Washington

eScience Institute"

Reply to this email directly or view it on GitHub: https://github.com/uwescience/myria-web/issues/261#issuecomment-86167965

jortiz16 commented 9 years ago

I don't remember why a view would be on that relation, but I probably did it. The only thing I do remember creating manually on those relations are indexes.

billhowe commented 9 years ago

On Wed, Mar 25, 2015 at 12:24 PM, jortiz16 notifications@github.com wrote:

I don't remember why a view would be on that relation, but I probably did it. The only thing I do remember creating manually on those relations are indexes.

Totally harmless, and obviously useful to do!

DELETE CASCADE will take care of it.

(In fact, it would be good to generate these indexes automatically, maybe as a nightly job by inspecting the query log and seeing what would be useful...)

— Reply to this email directly or view it on GitHub https://github.com/uwescience/myria-web/issues/261#issuecomment-86181380 .

Bill Howe Associate Director and Senior Data Science Fellow, UW eScience Institute Affiliate Faculty, Computer Science & Engineering University of Washington To acknowledge eScience: "Supported in part by the University of Washington eScience Institute"

dhalperi commented 9 years ago

@jortiz16 look familiar?

myria1=> \d+ "test";
                        View "public.test"
  Column  |       Type       | Modifiers | Storage  | Description 
----------+------------------+-----------+----------+-------------
 timestep | integer          |           | plain    | 
 iord     | integer          |           | plain    | 
 mass     | double precision |           | plain    | 
 grp      | integer          |           | plain    | 
 type     | text             |           | extended | 
View definition:
         SELECT DISTINCT 1 AS timestep, "public adhoc snapshot1818Hash".iord, "public adhoc snapshot1818Hash".mass, "public adhoc snapshot1818Hash".grp, "public adhoc snapshot1818Hash".type
           FROM "public:adhoc:snapshot1818Hash" "public adhoc snapshot1818Hash"
          WHERE ("public adhoc snapshot1818Hash".type = 'dark'::text OR "public adhoc snapshot1818Hash".type = 'gas'::text) AND "public adhoc snapshot1818Hash".grp = 4383
UNION 
         SELECT DISTINCT 2 AS timestep, p2.iord, p2.mass, p2.grp, p2.type
           FROM "public:adhoc:snapshot1818Hash" p1, "public:adhoc:snapshot1745Hash" p2
          WHERE p1.iord = p2.iord AND p2.grp <> (-1) AND (p1.type = 'dark'::text OR p1.type = 'gas'::text) AND p1.grp = 4383;
dhalperi commented 9 years ago

@jortiz16 dropped that view manually, and ran the 19 queries to overwrite all the snapshots you listed: https://demo.myria.cs.washington.edu/queries?max=69128&limit=19

dhalperi commented 9 years ago

FYI everyone:

I figured out where the problematic manually-created view was using this part of the error log:

Worker #12 failed..

Remembering the mapping from worker ID to machine, hard drive is:

So worker 12 is machine 14, disk 1.

jortiz16 commented 9 years ago

Okay, looking at my email history, this was back when Lee Lee and I were creating views to add as an optimization. Magda suggested this back when we were struggling to build the merger tree queries:

That view definition is us trying to get all the relevant particles from the present day snapshots.

And THANKS!! @dhalperi for deleting those relations for now. I can keep ingesting data for now.

mbalazin commented 9 years ago

Thanks Jenny for volunteering to implement this feature. It should be a good first task with Myria and you can ask others if you run into trouble.

Magdalena Balazinska Jean Loup Baer Professor of Computer Science and Engineering Associate Professor, Dept. of Computer Science & Engineering Director of the IGERT PhD Program in Big Data / Data Science Senior Data Science Fellow of the eScience Institute University of Washington

On Wed, Mar 25, 2015 at 12:42 PM, jortiz16 notifications@github.com wrote:

Okay, looking at my email history, this was back when Lee Lee and I were creating views to add as an optimization. Magda suggested this back when we were struggling to build the merger tree queries:

  • Add optimizations such as materialized views and study what types of queries can be interactive and which ones need to be computed in the background.

That view definition is us trying to get all the relevant particles from the present day snapshots.

And THANKS!! @dhalperi https://github.com/dhalperi for deleting those relations for now. I can keep ingesting data for now.

— Reply to this email directly or view it on GitHub https://github.com/uwescience/myria-web/issues/261#issuecomment-86188063 .

ljorr1 commented 9 years ago

I'll help with this. @jortiz16 and I are setting up MergerTree meetings and getting this implemented can be one of our goals.