Closed drj11 closed 10 years ago
My best guess at the moment is that if boxes are created in quick succession, they collide with each other.
If this is the problem, the solution will be to fix the locking mechanism.
I'll try and reproduce on cobalt dev.
meanwhile, better logging on this endpoint would be good (time and user for example); as would making it fail noisily if create_user
fails.
As for locking, I think the whole bit of editing the passwd file would be better handled with a single-purpose daemon that serialised access (in go or node.js). That would be my preferred fix for the locking mechanism.
haha, I was dreaming of doing that - and then decided against it. By channeling my inner you. Because it's yet another daemon which can fail and our monitoring still sucks.
Would such a daemon listen to events remotely, or would cobalt communicate to it over localhost?
I was thinking that it would live alongside cobalt, and that custard would communicate to it directly (or rather, with nginx in the way, as now).
Do you mean as a separate project, or living in cobalt/go/daemons
?
living inside cobalt somewhere. Because I can't see it being useful outside of cobalt.
Interesting observation, both the dataset itself and the view are missing users in the passwd file.
It could imply they collided with each other somehow.
Another example:
And the Twitter search tool:
https://scraperwiki.com/dataset/f5v7fgy https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/380903700
Note that Peter's observation applies to this case too. It seems quite likely!
Also note that there is lots of data in both datasets.
This user is a curious example. He made a code in browser tool and wrote some code, then started getting the error.
This leads me to believe the entries are being deleted after having been successfully made.
"Hi. I got this massive error message: "Trouble loading code! No passwd entry for user 'bckevyy'" The tool stopped working and i lost my scraper code..." https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/384398305
Yeah, so they have their database and code in /var/lib/cobalt/home/bckevyy
. But they have no entry in /var/lib/extrausers/passwd
or mount point under /jails
.
the mount point under /jails is unnecessary these days, it lives inside the mount namespace.
It's created here, which is inside the mount namespace, and therefore not visible on the host system. If there are left over jails, they are from a time before we had mount namespaces, or the mkdir is leaking to the system.
https://github.com/scraperwiki/salt-data/blob/master/salt/cobalt/pam_script_ses_open#L44
If directories are visible on the host system ordinarily, then the lack of one for this user is explained by the fact they can't login because there is no passwd entry for them, and then pam_script_ses_open doesn't run.
Are there mongo entries for these users? if so, we should just regenerate the passwd file to make these boxes work.
I've regenerated the passwd files on free, to fix it for now for users.
Instructions for another time here:
https://github.com/scraperwiki/intranet/wiki/Regenerate-the-passwd-shadow-files
I backed up the passwd file first. It turns out 8% of the entries were lost. As far as we can tell...
Wow. FWIW, I've made a start on the daemon, though I don't think it is the only reasonable solution.
However, there are a few advantages to doing it this way:
In a screen as frabcus (sudo inside the scren to root) on free server I'm running a script look-for-missing/do.sh once every minute.
It remakes the passwd file locally from mongo, and compares against the live one, logging any differences.
So we'll at least see when it changes and by how much....
So, something from time to time is removing parts of the passwd file in /var/lib/extrausers.
Not many places event call generate-extrafiles, or refer to /var/lib/extrausers. See this search https://github.com/search?q=user%3Ascraperwiki+extrausers&ref=cmdform&type=Code
This is the most suspicious one, in salt/cobalt/init.sls. It is not clear when exactly it runs.
scraperwiki-generate-extrafiles -boxName {{pillar.boxName}}:
cmd.wait:
- cwd: /var/lib/extrausers
- user: root
- group: root
- env:
CU_DB: {{pillar.cobalt.CU_DB}}
- watch:
- file: /var/lib/extrausers/passwd
- file: /var/lib/extrausers/shadow
- file: /var/lib/extrausers/group
- require:
- file: /usr/local/bin/scraperwiki-generate-extrafiles
Interestingly, this has picked up a missing row.
2014/04/30 11:02:56 Generated len(54350) passwd file (50268 deleted)
2014/04/30 11:04:01 Generated len(54352) passwd file (50268 deleted)
26954a26955
> bxfd7iq:x:260892495:10000::/home:/bin/bash
2014/04/30 11:05:05 Generated len(54352) passwd file (50268 deleted)
26954a26955
> bxfd7iq:x:260892495:10000::/home:/bin/bash
2014/04/30 11:06:10 Generated len(54352) passwd file (50268 deleted)
26954a26955
That box doesn't have a dataset in Mongo, just a box. The equivalent custard log segment is:
checkThisIsMyDataHub POST /api/zarrionwalker/datasets/b6lgrhy/views zarrionwalker zarrionwalker
server free-ec2.scraperwiki.com boxName bxfd7iq
err: 260892495 undefined null
BOX CREATE posting to https://free-ec2.scraperwiki.com/box/bxfd7iq
[SyntaxError: Unexpected token <]
Custard stopped
^[[37m^[[40mnpm^[[0m ^[[0m^[[30m^[[43mWARN^[[0m ^[[0m^[[35mpackage.json^[[0m backbone@1.0.0 No repository field.
^[[0m^[[37m^[[40mnpm^[[0m ^[[0m^[[30m^[[43mWARN^[[0m ^[[0m^[[35mpackage.json^[[0m connect-mongo@0.2.0 No repository field.
^[[0m
> custard@0.1.0 install /opt/custard
> node_modules/coffee-script/bin/cake build
So basically custard restarting left us a half-complete box.
This does not explain why we would lose passwd entries for already working boxes though, which is the subject of this bug.
Running overnight this is output now:
2014/05/01 05:09:11 Generated len(54529) passwd file (50679 deleted)
27095a27096
> bxfd7iq:x:260892495:10000::/home:/bin/bash
79950d79950
< fphwvxq:x:269700062:10000::/home:/bin/bash
One of those is from when custard was stopped, and the other David and I thought was a deleted dataset. With a fresh head, that assumption was wrong!
However, it isn't deleted! https://app.mongohq.com/mongohq-scraperwiki-com/mongo/cu-live-eu/collections/datasets/documents/o:52fe9000b3bfff7b2b0001a3
The mystery then is, why is scraperwiki-generate-extrafiles
not trying to write that entry from Mongo? It's a dataset from February, and it works completely fine right now in the user interface (because we haven't run scraperwiki-generate-extrafiles
to make live)
Ah! Interestingly the "deleted" box isn't deleted, it just has two entrys - WITH DIFFERENT UIDS - in the passwd file.
root@cobalt-f:~/look-for-missing# grep fphwvxq /var/lib/extrausers/passwd
fphwvxq:x:382163328:10000::/home:/bin/bash
fphwvxq:x:269700062:10000::/home:/bin/bash
They are on lines 67382 and 104834
Continuing, fphwvxq
in the boxes collection in Mongo has the earlier of the two uid.
uid: 382163328
Which, is also the uid that Unix thinks it has.
root@cobalt-f:~/look-for-missing# id fphwvxq
uid=382163328(fphwvxq) gid=10000(databox) groups=10000(databox)
And indeed the one that scraperwiki-generate-extrafiles
is generating. Hmmm.
From the birthday paradox, we expect 1e6 + 36**7*((36**7 - 1) / 36**7.)**1e6 - 36**7
~= 9 collisions in 100k boxnames if we're not checking for duplicates. I'll check soon to see if multiple users are ending up with the same box name.
I find no obvious evidence of checking for duplicates on custard side, where the box name is generated.
Maybe instead the box name should be generated cobalt side?
Crap, it's worse than that, we don't pick from the full 36 character namespace, we only actually pick from 1 in a billion. So we expect an average of 500 collisions. Or one for every 2000 boxes created.
Looking in the live mongo database, we have n box names == n unique names == 105224
.
Which seems to directly conflict with the fact that wc -l /var/lib/extrausers/passwd
== 105224 and it does contain the duplicate francis pointed at. :confounded:
Further observation, 269700062 doesn't exist at all.
Clarification edit: in the mongodb, that UID doesn't appear in the boxes list.
So /var/lib/extrausers/passwd
has exactly one duplicate that db.boxes
doesn't have, fphwvxq
. And db.boxes
has one user passwd
doesn't have: bxfd7iq
.
The duplicate could easily be explained by the birthday paradox. The missing dataset is harder.
Custard makes a request to cobalt:
In principle, custard doesn't write the dataset to mongo unless it receives a response from cobalt which doesn't contain the word "error". This is potentially wrong, instead it should be checking for 200 OK. What happens when there are gateway timeouts? I suppose the JSON.parse
may fail in that case, and then custard blows up.
But cobalt injects an error string only if it sees any stdout/stderr in the shell script it executes to do box creation. So if we have a silent fail (for example if anything in the &&
chain returns a non-zero exit status), then it could return an "error" free response, explaining this [drj: not true, I think; see below].
So there are at least two bug fixes here:
I think we should revisit the whole mechanism.
(I pretty much agree with everything already said).
(the design of the box database should be such that) the unix usernames should be unique across all cobalt instances, so that in principle we can teleport boxes from one instance to another. I think that means it's better if custard makes up the box names. Which means we keep the split where only custard talks to the database.
I quite like cobalt being pretty dumb and just doing what it's told.
Yes, except that cobalt should do local checks and reject if something is out of order, rather than allowing custard to continue in a broken state. [drj: yep]
(and custard should react to that correctly) [drj: yep]
I think if the process that cobalt starts returns with a non-zero status code then err
is set to something so this line of code will trigger:
return res.send {error: "Unable to create box"} if err? or any_stderr
I tested this yesterday, but here's a script I created just now to demonstrate what child_process.exec
does:
node -e 'require("child_process").exec("exit 99", function(err, stdout){console.log("err", err)})'
is has output:
err { [Error: Command failed: ] killed: false, code: 99, signal: null }
I kinda vaguely remembered this behaviour before. But it surprises me everytime.
Bottom line: I don't think the &&
chain is failing.
Requirement of fix:
@drj11 @pwaller and @frabcus collectively came to the following hypothesis:
A MongoDB unique key constraint means that a new box with the same boxname will fail to be saved to the table (this error is not checked); custard will continue to create a box on cobalt anyway. So this will appear as a unix username with two different uids.
A day later, and the live password file is still fine (just the two off-by-one dataset as noted above).
No sign of large numbers of them disappearing.
I still don't think we've any idea how they can go away - I think the output of https://trello.com/c/mSHdAtr0/25-fix-unix-user-missing-bug-1-point should probably be to tidy up logging. i.e. add it to error handling, make the messages tidier, maybe even use it as an excuse to deploy our general logging GUI.
I agree. This will help a lot.
We must also fix the error handling custard side and ideally make it so we only do one database write. We should also at least enlarge the keyspace to make collisions much more rare, or better, check for duplicates.
I think you should be one of the pair working on the Trello card :)
On Fri, May 02, 2014 at 01:09:35AM -0700, Peter Waller wrote:
I agree. This will help a lot.
We must also fix the error handling custard side and ideally make it so we only do one database write. We should also at least enlarge the keyspace to make collisions much more rare, or better, check for duplicates.
Reply to this email directly or view it on GitHub: https://github.com/scraperwiki/cobalt/issues/46#issuecomment-42001246
To confuse things further, there are 70,519
users in /opt/basejail/etc/passwd
, which should only contain system users
I've copied it over to /home/frabcus/passwd-backups/opt-basejail-etc-passwd
frabcus@cobalt-f:~/passwd-backups$ wc -l passwd
104557 passwd
frabcus@cobalt-f:~/passwd-backups$ cat passwd opt-basejail-etc-passwd | sort -u | wc -l
104702
So there are 114 boxes in the basejail passwd that aren't in passwd. (The other 31 are system ones).
Interestingly, 42 of those belong to deleted datasets 3 of them have {"error":"Unable to create box"}
as the boxJSON of the dataset and the rest (69 of them) don't obviously have anything wrong with them, so it is a strong mystery why they don't appear in the extrausers passwd file.
The mystery deepens.
fxwlw5a
is present in /opt/basejail/etc/passwd
, but not /var/lib/extrausers/passwd
. So ssh fxwlw5a@free
fails with no such user
in the auth log. It also shows HTTP 403 if you visit it, and you can't su to it. That's a view into the dataset di5lkqy
, which has the same problem.
So it turns out that these boxes are all missing from the boxes
collection.
(and hence don't find their way into the /var/lib/extrausers/passwd
).
It's a bit strange that this happens. Reading the code addDataset and addView code paths in custard it looks as though the box is created and then the dataset is created, so it's hard to see how the dataset can be created and the box not.
So in any case there exist datasets/boxes in an inconsistent state at the moment in mongodb. I wonder if
This hasn't happened again over the last 5 days. I've slowed my script that polls the passwd regeneration on a screen in ~frabcus on free, to run once every 10 minutes rather than every minute.
This bug is far too long, we have error logging, we are improving our 404, we are going to have sensu and we're going to add a check to sensu.
Closing.
Going back to the very original bug here, the 10% of users missing...
I note that generate-extrafiles doesn't seem to make a file of a new name and atomically move it into place.
passwd, err := os.Create("passwd")
So if something KILL'ed it (maybe salt can do that, possibly with a user interruption?) mid way through, it could leave a half written file.
@pwaller points out that the cat command spawned by cobalt could see a partial copy. This would happen if a user made a new box during an in-place run of generate-extrafiles.
flock -w 10 9 || exit 99
{ cat ${PASSWD_DIR}/passwd ; echo "$passwd_row" ; } > ${PASSWD_DIR}/passwd+
mv ${PASSWD_DIR}/passwd+ ${PASSWD_DIR}/passwd
To make extra files by hand, first export CU_DB then
scraperwiki-generate-extrafiles --boxName free-ec2.scraperwiki.com
Sometimes a box doesn't get its unix username in the passwd file. The result is bad for users, because they see lots of errors like: