jhuckaby / Cronicle

A simple, distributed task scheduler and runner with a web based UI.
http://cronicle.net
Other
3.72k stars 385 forks source link

I thought I had messed up with S3 (via MinIO) and clobbered something in replication... #618

Open jhuckaby opened 1 year ago

jhuckaby commented 1 year ago

From @care2DavidDPD:

I thought I had messed up with S3 (via MinIO) and clobbered something in replication. This is also a cluster of 6 machines ... cross two - San Jose (SJC1) and Reston,VA (IAD1). For example, I have this job, that creates a static dump/backup/snapshot of my IPAM/DCIM/OpsDB into a file on the NFS server:

Screenshot 2023-07-07 at 4 04 56 PM

When I "up-to-history" ... it's blank ...

Screenshot 2023-07-07 at 4 07 38 PM

However, JobID, which I manually ran - is there.

Screenshot 2023-07-07 at 4 08 40 PM

The log file is on both MinIO instances:

[ pts/4 iad1 nas1:/sr/minio/cronicle/jobs ]
[ dpd ] > md5 jljt65wn90g.json
MD5 (jljt65wn90g.json) = 161e70ac76d0cfc27c5abf4f95357f08

[ pts/0 sjc1 nas1:/sr/minio/cronicle/jobs ]
[ dpd ] > md5 jljt65wn90g.json
MD5 (jljt65wn90g.json) = 161e70ac76d0cfc27c5abf4f95357f08

... and the history is for this event ID is actually on disk ... and look to be mostly in sync ...

[ pts/4 iad1 nas1:/sr/minio/cronicle/jobs ]
[ dpd ] > find . -name "*.json" | xargs -I % grep -L elfw2dwqq53 % | wc
   52747   52747 1002193

[ pts/0 sjc1 nas1:/sr/minio/cronicle/jobs ]
[ dpd ] > find . -name "*.json" | xargs -I % grep -L elfw2dwqq53 % | wc
   52747   52747 1002193

It seems to me, some sort of index file or something is missing and not getting updating ... I just don't know where in the code to look. The "All Completed Jobs" in https://cronicle/#History?sub=history is also empty .... however the jobs that run less frequently ... still seem to display histories.

Originally posted by @care2DavidDPD in https://github.com/jhuckaby/Cronicle/issues/613#issuecomment-1626352903

jhuckaby commented 1 year ago

This looks like classic corruption of the "database" files on disk, which can happen if you use a non-AWS S3 provider. The S3 provider needs to provide "immediate consistency" which AWS S3 only added recently (in 2019 I think). Not sure if your MinIO implementation has it, but that's a classic cause of issues like this.

The other cause could be a crash or sudden power loss of the master node during writes. Cronicle was designed years before I added full transaction support to my storage system, so it is still susceptible to this. It will be fixed in Cronicle v2.

jhuckaby commented 1 year ago

To fix your issue, try an export, followed by a wipe of all data, then an import. Note that this will lose your job history, as the exported data only includes vital essentials like users, schedule, categories, plugins, etc.

https://github.com/jhuckaby/Cronicle/blob/master/docs/CommandLine.md#data-import-and-export

care2DavidDPD commented 1 year ago

To fix your issue, try an export, followed by a wipe of all data, then an import. Note that this will lose your job history, as the exported data only includes vital essentials like users, schedule, categories, plugins, etc.

Well, since the job history is kind of messed up as it is ... it's not a huge deal, and due to the version of Minio and myself, the next minio update - looks like it might require object level migration - however, is there any way to rebuild the job history, this "database" ?

If not, can you point to me to the code responsible generating the job history? Want to see how hard it is to make a fix-it tool - not really important for this incident, but looking toward the future, when there is job history I want to maintain.

jhuckaby commented 1 year ago

Sure, the job history is a "list" which is created and managed by my pixl-server-storage module. See this doc specifically which explains how lists work:

https://github.com/jhuckaby/pixl-server-storage/blob/master/docs/Lists.md

There's one list which is the global history (all completed jobs), and then each event has its own history (separate lists for each).

The global event history is located at logs/completed, and the event histories are at logs/events/EVENT_ID.

The Cronicle code which calls to the storage API to append to the lists is here: https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1328-L1334

jhuckaby commented 1 year ago

I should add that enabling Transactions at the storage level will probably help with these issues, as it will rollback to a known good state for crashes and/or power loss. But I have never tested transactions with Cronicle (it predates them by many years).

That being said, the most important thing by far is immediate consistency for the underlying storage provider. A read after a write MUST return the latest data for each record. I have no idea of these 3rd party S3 clones (MinIO, etc.) support this. It is a hard requirement for Cronicle.

Update: According to this doc MinIO supports immediate consistency, so in that case I have absolutely no idea how your data became corrupted.

jhuckaby commented 1 year ago

Cronicle v0.9.24 was just released, which ships with transactions enabled by default (however you'll need to manually enable them for existing installs), as well as a new storage repair script.

Check out the updated Troubleshooting Wiki for details and instructions.

mikeTWC1984 commented 1 year ago

@care2DavidDPD Are you still having this issue? Is it that you can see logs in minio, but not on UI?

mikeTWC1984 commented 1 year ago

It's likely unrelated, but I had a similar issue with UI. By some weird reason some cronicle API endpoints failed after timeout. I was able to fix it by restarting my machine. It wasn't obvious in the browser initially. You can try to query your logs with storage-cli tool, and see what kind of error you get:

bin/storage-cli.js list_get logs/completed 0 5 bin/storage-cli.js list_get logs/events/ellu93fdo01 0 5

(0 is offset, 5 is limit)

care2DavidDPD commented 1 year ago

Cronicle v0.9.24 was just released, which ships with transactions enabled by default (however you'll need to manually enable them for existing installs), as well as a new storage repair script.

Check out the updated Troubleshooting Wiki for details and instructions.

Sorry for the long delay , finally getting to trying this. However, is Storage.AWS.hostPrefixEnabled = false not being respected in storage-repair.js ? Well, actually all of Cronicle seems to be ignoring this ( part of the aws npm library ) ?

[ pts/0 iad1 cron1:/opt/cronicle ]
[ dpd ] > sudo /opt/cronicle/bin/storage-repair.js --dryrun --echo

Cronicle Storage Repair Script v1.0.0 starting up
[1693893478.05][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][][debug][1][Cronicle Storage Repair Script v1.0.0 starting up][]
Starting storage engine
[1693893478.053][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][][debug][2][Starting storage engine][]
[1693893478.056][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][Storage][debug][2][Setting up storage system v3.1.15][]
[1693893478.807][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][S3][debug][2][Setting up Amazon S3 (care2)][]
[1693893478.808][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][S3][debug][3][S3 Bucket ID: cronicle][]
Storage engine is ready to go
[1693893478.819][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][S3][debug][2][Storage engine is ready to go][]
[1693893478.82][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][S3][debug][3][Testing storage engine][]
[1693893478.823][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][S3][debug][9][Fetching S3 Object: global/users][]
[1693893478.906][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][S3][error][s3][Failed to fetch key: global/users: getaddrinfo ENOTFOUND cronicle.nas1.iad1.care2.com][{"errno":-3008,"code":"ENOTFOUND","syscall":"getaddrinfo","hostname":"cronicle.nas1.iad1.care2.com","$metadata":{"attempts":1,"totalRetryDelay":0}}]
[1693893478.908][2023-09-05 05:57:58][cron1.iad1.care2.com][336856][S3][error][fatal][Storage test failure: Error: getaddrinfo ENOTFOUND cronicle.nas1.iad1.care2.com][]

ERROR: Storage test failure: Error: getaddrinfo ENOTFOUND cronicle.nas1.iad1.care2.com

I can't really work around this with a CNAME or /etc/hosts entry ... because of the SSL Certificates ... though could issue new certificates ... certificate management is already messy enough.

ERROR: Storage test failure: Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: Host: cronicle.nas1.iad1.care2.com. is not in the cert's altnames: DNS:*.corp.care2.com, DNS:*.iad1.care2.com, DNS:*.qae1.care2.com, DNS:*.qaw1.care2.com, DNS:*.sjc1.care2.com, DNS:*.snv1.care2.com, DNS:corp.care2.com
care2DavidDPD commented 1 year ago

Sorry for the long delay , finally getting to trying this. However, is Storage.AWS.hostPrefixEnabled = false not being respected in storage-repair.js ? Well, actually all of Cronicle seems to be ignoring this ( part of the aws npm library ) ?

I worked around this, issued new certificates, etc, and was able to run the storage fix, and it does seem like it fixed up the logs ... I'll have to look at them more in the daylight hours, but looks good.

But really do need hostPrefixEnabled = false to be fixed .. unclear why ... even bumped aws-sdk.

jhuckaby commented 1 year ago

Hmmm, I don't know anything about hostPrefixEnabled, but it looks like AWS deprecated it in v3:

https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/preview/migrating/notable-changes/

care2DavidDPD commented 1 year ago

Hmmm, I don't know anything about hostPrefixEnabled, but it looks like AWS deprecated it in v3:

https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/preview/migrating/notable-changes/

Ugh. That sucks. Yeah, this is beyond Cronicle ... it's a MinIO & aws-sdk issue ... I've worked around it, but for those who land here otherwise, this means for minio ... you must set minio_env="MINIO_DOMAIN= ..." to the base domain of your cluster, then for each bucket you create, you'll need to add a CNAME into your DNS. Since aws-sdk is most likely the way everyone accesses minio ... if deploying your own object store, you'll need some sort of additional "orchestra" layer ... to provision users, buckets and DNS in your MinIO Object Store (since DNS is not a system or config within MinIO, it usually another discrete system).

EDIT: Though, I haven't looked at it in detail, and unlikely completely drop-in, a Cronicle side fix would be to update the pixl-server-storage with the MinIO driver. I suspect this would allow a hostname without the prefixed bucketname. This isn't urgent for me, as it's unlikely we'll dynamically or even frequently be create MinIO buckets, so the CNAME extra step, just annoys me. (As this will likely be picked up in other aws-sdk languages, and then picked up in other tools like s3fs-fuse).

https://min.io/docs/minio/linux/developers/javascript/minio-javascript.html