scraperwiki / cobalt

QuickCode service to run code in a sandboxed Unix shell account
https://quickcode.io/
Other
9 stars 4 forks source link

No passwd entry for user #46

Closed drj11 closed 10 years ago

drj11 commented 10 years ago

Sometimes a box doesn't get its unix username in the passwd file. The result is bad for users, because they see lots of errors like:

My tweet scraping setups say "No passwd entry for user.." for a random username
pwaller commented 10 years ago

My best guess at the moment is that if boxes are created in quick succession, they collide with each other.

https://github.com/scraperwiki/cobalt/blob/bfb9fa9abdffea10a2512d5ad162a4df571a21f3/code/server.coffee#L317

If this is the problem, the solution will be to fix the locking mechanism.

I'll try and reproduce on cobalt dev.

drj11 commented 10 years ago

meanwhile, better logging on this endpoint would be good (time and user for example); as would making it fail noisily if create_user fails.

As for locking, I think the whole bit of editing the passwd file would be better handled with a single-purpose daemon that serialised access (in go or node.js). That would be my preferred fix for the locking mechanism.

pwaller commented 10 years ago

haha, I was dreaming of doing that - and then decided against it. By channeling my inner you. Because it's yet another daemon which can fail and our monitoring still sucks.

Would such a daemon listen to events remotely, or would cobalt communicate to it over localhost?

drj11 commented 10 years ago

I was thinking that it would live alongside cobalt, and that custard would communicate to it directly (or rather, with nginx in the way, as now).

pwaller commented 10 years ago

Do you mean as a separate project, or living in cobalt/go/daemons?

drj11 commented 10 years ago

living inside cobalt somewhere. Because I can't see it being useful outside of cobalt.

frabcus commented 10 years ago

An example:

image

https://scraperwiki.com/dataset/bckevyy https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/384398305

pwaller commented 10 years ago

Interesting observation, both the dataset itself and the view are missing users in the passwd file.

It could imply they collided with each other somehow.

frabcus commented 10 years ago

Another example:

image

And the Twitter search tool:

image

https://scraperwiki.com/dataset/f5v7fgy https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/380903700

frabcus commented 10 years ago

Note that Peter's observation applies to this case too. It seems quite likely!

Also note that there is lots of data in both datasets.

frabcus commented 10 years ago

This user is a curious example. He made a code in browser tool and wrote some code, then started getting the error.

This leads me to believe the entries are being deleted after having been successfully made.

"Hi. I got this massive error message: "Trouble loading code! No passwd entry for user 'bckevyy'" The tool stopped working and i lost my scraper code..." https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/384398305

image

Yeah, so they have their database and code in /var/lib/cobalt/home/bckevyy. But they have no entry in /var/lib/extrausers/passwd or mount point under /jails.

pwaller commented 10 years ago

the mount point under /jails is unnecessary these days, it lives inside the mount namespace.

It's created here, which is inside the mount namespace, and therefore not visible on the host system. If there are left over jails, they are from a time before we had mount namespaces, or the mkdir is leaking to the system.

https://github.com/scraperwiki/salt-data/blob/master/salt/cobalt/pam_script_ses_open#L44

If directories are visible on the host system ordinarily, then the lack of one for this user is explained by the fact they can't login because there is no passwd entry for them, and then pam_script_ses_open doesn't run.

pwaller commented 10 years ago

Are there mongo entries for these users? if so, we should just regenerate the passwd file to make these boxes work.

frabcus commented 10 years ago

I've regenerated the passwd files on free, to fix it for now for users.

Instructions for another time here:

https://github.com/scraperwiki/intranet/wiki/Regenerate-the-passwd-shadow-files

frabcus commented 10 years ago

I backed up the passwd file first. It turns out 8% of the entries were lost. As far as we can tell...

pwaller commented 10 years ago

Wow. FWIW, I've made a start on the daemon, though I don't think it is the only reasonable solution.

However, there are a few advantages to doing it this way:

frabcus commented 10 years ago

In a screen as frabcus (sudo inside the scren to root) on free server I'm running a script look-for-missing/do.sh once every minute.

It remakes the passwd file locally from mongo, and compares against the live one, logging any differences.

So we'll at least see when it changes and by how much....

frabcus commented 10 years ago

So, something from time to time is removing parts of the passwd file in /var/lib/extrausers.

Not many places event call generate-extrafiles, or refer to /var/lib/extrausers. See this search https://github.com/search?q=user%3Ascraperwiki+extrausers&ref=cmdform&type=Code

This is the most suspicious one, in salt/cobalt/init.sls. It is not clear when exactly it runs.

scraperwiki-generate-extrafiles -boxName {{pillar.boxName}}:
  cmd.wait:
    - cwd: /var/lib/extrausers
    - user: root
    - group: root
    - env:
        CU_DB: {{pillar.cobalt.CU_DB}}
    - watch:
       - file: /var/lib/extrausers/passwd
       - file: /var/lib/extrausers/shadow
       - file: /var/lib/extrausers/group
    - require:
      - file: /usr/local/bin/scraperwiki-generate-extrafiles
frabcus commented 10 years ago

Interestingly, this has picked up a missing row.

2014/04/30 11:02:56 Generated len(54350) passwd file (50268 deleted)
2014/04/30 11:04:01 Generated len(54352) passwd file (50268 deleted)
26954a26955
> bxfd7iq:x:260892495:10000::/home:/bin/bash
2014/04/30 11:05:05 Generated len(54352) passwd file (50268 deleted)
26954a26955
> bxfd7iq:x:260892495:10000::/home:/bin/bash
2014/04/30 11:06:10 Generated len(54352) passwd file (50268 deleted)
26954a26955

That box doesn't have a dataset in Mongo, just a box. The equivalent custard log segment is:

checkThisIsMyDataHub POST /api/zarrionwalker/datasets/b6lgrhy/views zarrionwalker zarrionwalker
server free-ec2.scraperwiki.com boxName bxfd7iq
err:  260892495 undefined null
BOX CREATE posting to https://free-ec2.scraperwiki.com/box/bxfd7iq
[SyntaxError: Unexpected token <]
Custard stopped
^[[37m^[[40mnpm^[[0m ^[[0m^[[30m^[[43mWARN^[[0m ^[[0m^[[35mpackage.json^[[0m backbone@1.0.0 No repository field.
^[[0m^[[37m^[[40mnpm^[[0m ^[[0m^[[30m^[[43mWARN^[[0m ^[[0m^[[35mpackage.json^[[0m connect-mongo@0.2.0 No repository field.
^[[0m
> custard@0.1.0 install /opt/custard
> node_modules/coffee-script/bin/cake build

So basically custard restarting left us a half-complete box.

This does not explain why we would lose passwd entries for already working boxes though, which is the subject of this bug.

frabcus commented 10 years ago

Running overnight this is output now:

2014/05/01 05:09:11 Generated len(54529) passwd file (50679 deleted)
27095a27096
> bxfd7iq:x:260892495:10000::/home:/bin/bash
79950d79950
< fphwvxq:x:269700062:10000::/home:/bin/bash

One of those is from when custard was stopped, and the other David and I thought was a deleted dataset. With a fresh head, that assumption was wrong!

However, it isn't deleted! https://app.mongohq.com/mongohq-scraperwiki-com/mongo/cu-live-eu/collections/datasets/documents/o:52fe9000b3bfff7b2b0001a3

The mystery then is, why is scraperwiki-generate-extrafiles not trying to write that entry from Mongo? It's a dataset from February, and it works completely fine right now in the user interface (because we haven't run scraperwiki-generate-extrafiles to make live)

frabcus commented 10 years ago

Ah! Interestingly the "deleted" box isn't deleted, it just has two entrys - WITH DIFFERENT UIDS - in the passwd file.

root@cobalt-f:~/look-for-missing# grep fphwvxq /var/lib/extrausers/passwd
fphwvxq:x:382163328:10000::/home:/bin/bash
fphwvxq:x:269700062:10000::/home:/bin/bash

They are on lines 67382 and 104834

frabcus commented 10 years ago

Continuing, fphwvxq in the boxes collection in Mongo has the earlier of the two uid.

uid: 382163328

Which, is also the uid that Unix thinks it has.

root@cobalt-f:~/look-for-missing# id fphwvxq
uid=382163328(fphwvxq) gid=10000(databox) groups=10000(databox)

And indeed the one that scraperwiki-generate-extrafiles is generating. Hmmm.

pwaller commented 10 years ago

From the birthday paradox, we expect 1e6 + 36**7*((36**7 - 1) / 36**7.)**1e6 - 36**7 ~= 9 collisions in 100k boxnames if we're not checking for duplicates. I'll check soon to see if multiple users are ending up with the same box name.

pwaller commented 10 years ago

I find no obvious evidence of checking for duplicates on custard side, where the box name is generated.

https://github.com/scraperwiki/custard/blob/99a812e84a99e249662fbc204dbdf10bb19ac1ae/server/code/model/box.coffee#L156

https://github.com/scraperwiki/custard/blob/99a812e84a99e249662fbc204dbdf10bb19ac1ae/server/code/model/box.coffee#L108

Maybe instead the box name should be generated cobalt side?

pwaller commented 10 years ago

Crap, it's worse than that, we don't pick from the full 36 character namespace, we only actually pick from 1 in a billion. So we expect an average of 500 collisions. Or one for every 2000 boxes created.

pwaller commented 10 years ago

Looking in the live mongo database, we have n box names == n unique names == 105224.

pwaller commented 10 years ago

Which seems to directly conflict with the fact that wc -l /var/lib/extrausers/passwd == 105224 and it does contain the duplicate francis pointed at. :confounded:

pwaller commented 10 years ago

Further observation, 269700062 doesn't exist at all.

Clarification edit: in the mongodb, that UID doesn't appear in the boxes list.

pwaller commented 10 years ago

So /var/lib/extrausers/passwd has exactly one duplicate that db.boxes doesn't have, fphwvxq. And db.boxes has one user passwd doesn't have: bxfd7iq.

pwaller commented 10 years ago

The duplicate could easily be explained by the birthday paradox. The missing dataset is harder.

Custard makes a request to cobalt:

https://github.com/scraperwiki/custard/blob/99a812e84a99e249662fbc204dbdf10bb19ac1ae/server/code/model/box.coffee#L142

In principle, custard doesn't write the dataset to mongo unless it receives a response from cobalt which doesn't contain the word "error". This is potentially wrong, instead it should be checking for 200 OK. What happens when there are gateway timeouts? I suppose the JSON.parse may fail in that case, and then custard blows up.

But cobalt injects an error string only if it sees any stdout/stderr in the shell script it executes to do box creation. So if we have a silent fail (for example if anything in the && chain returns a non-zero exit status), then it could return an "error" free response, explaining this [drj: not true, I think; see below].

So there are at least two bug fixes here:

I think we should revisit the whole mechanism.

drj11 commented 10 years ago

(I pretty much agree with everything already said).

(the design of the box database should be such that) the unix usernames should be unique across all cobalt instances, so that in principle we can teleport boxes from one instance to another. I think that means it's better if custard makes up the box names. Which means we keep the split where only custard talks to the database.

I quite like cobalt being pretty dumb and just doing what it's told.

pwaller commented 10 years ago

Yes, except that cobalt should do local checks and reject if something is out of order, rather than allowing custard to continue in a broken state. [drj: yep]

pwaller commented 10 years ago

(and custard should react to that correctly) [drj: yep]

drj11 commented 10 years ago

I think if the process that cobalt starts returns with a non-zero status code then err is set to something so this line of code will trigger:

return res.send {error: "Unable to create box"} if err? or any_stderr

I tested this yesterday, but here's a script I created just now to demonstrate what child_process.exec does:

node -e 'require("child_process").exec("exit 99", function(err, stdout){console.log("err", err)})'

is has output:

err { [Error: Command failed: ] killed: false, code: 99, signal: null }

I kinda vaguely remembered this behaviour before. But it surprises me everytime.

Bottom line: I don't think the && chain is failing.

pwaller commented 10 years ago

Requirement of fix:

drj11 commented 10 years ago

@drj11 @pwaller and @frabcus collectively came to the following hypothesis:

A MongoDB unique key constraint means that a new box with the same boxname will fail to be saved to the table (this error is not checked); custard will continue to create a box on cobalt anyway. So this will appear as a unix username with two different uids.

frabcus commented 10 years ago

A day later, and the live password file is still fine (just the two off-by-one dataset as noted above).

No sign of large numbers of them disappearing.

I still don't think we've any idea how they can go away - I think the output of https://trello.com/c/mSHdAtr0/25-fix-unix-user-missing-bug-1-point should probably be to tidy up logging. i.e. add it to error handling, make the messages tidier, maybe even use it as an excuse to deploy our general logging GUI.

pwaller commented 10 years ago

I agree. This will help a lot.

We must also fix the error handling custard side and ideally make it so we only do one database write. We should also at least enlarge the keyspace to make collisions much more rare, or better, check for duplicates.

frabcus commented 10 years ago

I think you should be one of the pair working on the Trello card :)

On Fri, May 02, 2014 at 01:09:35AM -0700, Peter Waller wrote:

I agree. This will help a lot.

We must also fix the error handling custard side and ideally make it so we only do one database write. We should also at least enlarge the keyspace to make collisions much more rare, or better, check for duplicates.


Reply to this email directly or view it on GitHub: https://github.com/scraperwiki/cobalt/issues/46#issuecomment-42001246

pwaller commented 10 years ago

To confuse things further, there are 70,519 users in /opt/basejail/etc/passwd, which should only contain system users

pwaller commented 10 years ago

I've copied it over to /home/frabcus/passwd-backups/opt-basejail-etc-passwd

frabcus@cobalt-f:~/passwd-backups$ wc -l passwd
104557 passwd
frabcus@cobalt-f:~/passwd-backups$ cat passwd opt-basejail-etc-passwd | sort -u | wc -l
104702

So there are 114 boxes in the basejail passwd that aren't in passwd. (The other 31 are system ones).

Interestingly, 42 of those belong to deleted datasets 3 of them have {"error":"Unable to create box"} as the boxJSON of the dataset and the rest (69 of them) don't obviously have anything wrong with them, so it is a strong mystery why they don't appear in the extrausers passwd file.

pwaller commented 10 years ago

The mystery deepens.

fxwlw5a is present in /opt/basejail/etc/passwd, but not /var/lib/extrausers/passwd. So ssh fxwlw5a@free fails with no such user in the auth log. It also shows HTTP 403 if you visit it, and you can't su to it. That's a view into the dataset di5lkqy, which has the same problem.

pwaller commented 10 years ago

So it turns out that these boxes are all missing from the boxes collection.

pwaller commented 10 years ago

(and hence don't find their way into the /var/lib/extrausers/passwd).

It's a bit strange that this happens. Reading the code addDataset and addView code paths in custard it looks as though the box is created and then the dataset is created, so it's hard to see how the dataset can be created and the box not.

pwaller commented 10 years ago

So in any case there exist datasets/boxes in an inconsistent state at the moment in mongodb. I wonder if

frabcus commented 10 years ago

This hasn't happened again over the last 5 days. I've slowed my script that polls the passwd regeneration on a screen in ~frabcus on free, to run once every 10 minutes rather than every minute.

drj11 commented 10 years ago

This bug is far too long, we have error logging, we are improving our 404, we are going to have sensu and we're going to add a check to sensu.

Closing.

frabcus commented 10 years ago

Going back to the very original bug here, the 10% of users missing...

I note that generate-extrafiles doesn't seem to make a file of a new name and atomically move it into place.

    passwd, err := os.Create("passwd")

So if something KILL'ed it (maybe salt can do that, possibly with a user interruption?) mid way through, it could leave a half written file.

frabcus commented 10 years ago

@pwaller points out that the cat command spawned by cobalt could see a partial copy. This would happen if a user made a new box during an in-place run of generate-extrafiles.

    flock -w 10 9 || exit 99
    { cat ${PASSWD_DIR}/passwd ; echo "$passwd_row" ; } > ${PASSWD_DIR}/passwd+
    mv ${PASSWD_DIR}/passwd+ ${PASSWD_DIR}/passwd
frabcus commented 10 years ago

To make extra files by hand, first export CU_DB then

scraperwiki-generate-extrafiles --boxName free-ec2.scraperwiki.com