nlplab / brat

brat rapid annotation tool (brat) - for all your textual annotation needs
http://brat.nlplab.org
Other
1.81k stars 510 forks source link

robots.txt format not appropriate for ACL #990

Open amadanmath opened 11 years ago

amadanmath commented 11 years ago

There are two issues with appropriating robots.txt format for our access control, one for each type of line allowed in robots.txt.

The first one is that the Disallow line's content is not really relevant, since we define acl.conf per collection anyway. Thus, it will always be Disallow: / (or it could have the path of the collection, which is not really practical since one would have to change it if the collection is renamed or moved). A line that must always be present should not be required; but it obviously is required if we use an off-the-shelf parser for robots.txt.

The more serious problem concerns the User-agent line: it matches substrings (and it is case-insensitive). Thus, if we intend to only allow the user "ted" in a certain collection using

User-agent: ted
Disallow:

User-agent: *
Disallow: /

it will also allow users "stedman" and "PeteD".

(Given how our default debugging username is "e", I had a fun morning trying to find out why I can't block the un-logged-in user. I was at the point where I started thinking it might be a Python bug. Then I realised that if I write User-agent: e, it also allows "guest" on the account of "guest" containing an "e").

Thus, I submit that a custom file format would be better. Maybe just list of allowed usernames, one per line. If the file does not exist, then the collection is open for all.

I believe nothing more complex should be needed at the resolution we're looking at. If we disregard collection hierarchy (which is what Disallow line handles in robots.txt, and we handle by putting the file in an appropriate directory), we only have one dimension: user names. If we had user roles or user groups, we might need both "allow" and "disallow" functions, but given that we can only target users individually, listing them individually in a file is all that should be needed.

I know this suggestion breaks compatibility, but I don't think we've heard from many users utilising acl.conf; also, AFAIK it is not documented at all, apart from a comment in configurations/acl.conf, so we should be able to switch it out without too much fuss.

spyysalo commented 11 years ago

@amadanmath: good points, hadn't realized those restrictions on how robots.txt parsing operates.

We actually have a simple pseudo-robots.txt-parser/permission checker in the codebase already, in https://github.com/nlplab/brat/blob/master/standalone.py . I think this could be pretty trivially adapted for use also in user management if we want to keep (much of) the syntax.

ned2 commented 11 years ago

As a user who was looking for this functionality, just thought I'd throw my 2 cents in...

I agree with @amadanmath: a file listing usernames, one per line, as a whitelist for access to specific collections would be great. The file (if it exists) could just sit in the relevant collection directory itself. Then it would just be a file of usernames -- no parsing involved. Intuitive, simple to implement, super simple to use.

ghost commented 11 years ago

I agree entirely with @humblecoffee on this one. Only one minor addition. Allow for blacklisting and whitelisting. I can't settle upon a suggested syntax though.

spyysalo commented 11 years ago

Suggest to retain as much of current (robots.txt) syntax as possible to reduce disruption to users of this feature.

ned2 commented 11 years ago

The simplest way to do it would be two have two separate files eg 'whitelist.conf' and 'blacklist.conf'.

Then you don't have to come up with a new syntax and the two systems can live alongside each other, but possibly only documenting the new approach and then down the line phasing out the acl.conf approach.