tumblr / collins

groovy kind of love
tumblr.github.com/collins
Apache License 2.0
571 stars 99 forks source link

Delete Multiple attributes #271

Closed solomongifford closed 9 years ago

solomongifford commented 9 years ago

Is it possible to delete multiple attributes at the same time?

curl --basic -u blake:admin:first \ -X DELETE \ http://localhost:9000/api/asset/tumblrtag30/attribute/NODECLASS

This appears to only be able to delete a single attribute.

Something like the following makes sense:

curl --basic -u blake:admin:first \ -X DELETE \ http://localhost:9000/api/asset/tumblrtag30/attribute/NODECLASS,ATTR2,ATTR3

william-richard commented 9 years ago

Hi @solomongifford

That isn't currently possible, and I don't see us adding that functionality. I think that only doing one at a time is sufficient, and makes error handling clearer and easier to understand. For example, if deleting one of the attributes fails, what error code should it return? How does it indicate which deletes failed and which passed? If one delete fails, do all of the deletes fail, does it try to do as many deletes as possible, or does it do some of the deletes?

Is there a specific case that you're running into where you need to be able to delete multiple attributes atomically?

solomongifford commented 9 years ago

Thanks for the quick reply.

We are looking to use collins in a scenario where we have 600-2000 custom tags (attributes) per machine. Specifically, think about collins storing which packages/version are installed on a machine in addition to other similar information. When automatic updates run, the versions of many packages change across the entire network and would need to be reflected in collins.

If each machine has to be updated daily, batching the "diff" would be much quicker than looping through each. With thousands of machines in our network, this becomes a performance issue.

(I was going to ask the same question about adding tags - but assumed the answers would be similar.)

william-richard commented 9 years ago

No problem!

We do something similar for some of our systems, where the version of a package or a piece of software is stored as an attribute on the asset. We haven't seen too many performance issues on updating or deleting the attribute, though it isn't particularly fast. We have seen a lag with solr, where it will take time for solr to return those assets in relevant queries. For example, if you change SOFTWARE_VERISON from 1 to 2 on asset ABCD, it will take some time for solr to update it's indexes, and so a query for assets where SOFTWARE_VERSION = 2 may not return asset `ABCD'.

Also, keep in mind that collins is a dumb datastore. It is going to be easy with that many attributes on each asset for things to get out of sync with what is actually installed. Collins is meant to drive automation, so for us setting SOFTWARE_VERSION to 2 then instructs something else to change the version of the software, rather than the other way around. The best source for what is installed, for us, is always the machine.

If you are seeing significant issues updating that many tags on each asset, you might also want to consider combining information into a JSON blob, and store that in the attribute instead of having a different attribute for each key-value pair. This will reduce the number of queries you need to do, and it will allow you to do atomic updates to several key-value pairs at the same time (that isn't something you asked for, but it does seem like it would be nice to have).

Alternately, we have also started playing with using Configuration assets to store a list of asset IDs matching certain criteria. This also means updates are very fast (instead of changing an attribute on every asset, we change one attribute on one asset), and this also bypasses the solr lag issue, since you do not need to go to solr to get the attributes associated with an asset. We aren't using this concept in any production systems yet, but results so far look promising.

(I was going to ask the same question about adding tags - but assumed the answers would be similar.)

Yep, the answer is the same. :smile:

I hope that was helpful. If you do decide that you want this agumentation to the delete API, we always welcome PRs!

solomongifford commented 9 years ago

Thanks much for the quick reply. I've been running some apache benchmark tests today on a dataset with 20K items each with 1600 tags. So far the results are promising.

When you say there's a delay in the amount of time solr takes to update, how much time are we talking? Minutes? Hours?

william-richard commented 9 years ago

Glad to hear your benchmarks look good.

Solr takes seconds to minutes to update its indexes. It just isn't immediate, which can cause problems if you assume that it is immediate.

solomongifford commented 9 years ago

Thanks.