Deleted 91+day-old shards. Stats now steadily tanking to the "Abyssssss"...."sss"

Hypocritus commented 6 years ago

Ubuntu 16.04 storjshare-daemon 5.3.1 1.20 Node 8.xx.xx (latest LTS)

Having discovered shard data 4+ months old, and knowing that according to several sources, the reaper should be deleting shards that are more than 90 days old each day, I thought something must be wrong with the nodes showing a lot of shards older than 90 days. last night I set a routine to delete shards older than "91" days.

Now my logs are complaining of corruption and missing shards, and my stats are steadily plummeting to the Nether Region of Storjdom even though my nodes are 1) up and running, 2) have gone through the reaper since having deleted the 91+day shards, and 3) have had kfs ... compact run on them. I have verified that "newer than 91-day" shards were not removed.

My nodes haven't been missing a beat during the last few days. Why am I being penalized for having deleted 91+day shards?

What can be done about this, aside from resurrecting the now-cremated 91+day shards?

{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/045.s/000347.ldb","timestamp":"2018-08-17T11:40:20.521Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/147.s/000331.ldb","timestamp":"2018-08-17T12:40:18.381Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 86 missing files; e.g.: /STORJ/node01/sharddata.kfs/238.s/000083.ldb","timestamp":"2018-08-17T12:46:36.040Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 104 missing files; e.g.: /STORJ/node01/sharddata.kfs/131.s/000231.ldb","timestamp":"2018-08-17T13:34:24.719Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/199.s/000053.ldb","timestamp":"2018-08-17T13:57:24.568Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/177.s/000335.ldb","timestamp":"2018-08-17T14:27:39.271Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/105.s/000085.ldb","timestamp":"2018-08-17T14:53:43.017Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 133 missing files; e.g.: /STORJ/node01/sharddata.kfs/213.s/000213.ldb","timestamp":"2018-08-17T16:43:49.808Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 6 missing files; e.g.: /STORJ/node01/sharddata.kfs/043.s/000095.ldb","timestamp":"2018-08-17T17:20:05.194Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/021.s/000055.ldb","timestamp":"2018-08-17T18:23:12.926Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 5 missing files; e.g.: /STORJ/node01/sharddata.kfs/015.s/000035.ldb","timestamp":"2018-08-17T19:26:43.653Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 3 missing files; e.g.: /STORJ/node01/sharddata.kfs/227.s/000087.ldb","timestamp":"2018-08-17T20:09:28.041Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 7 missing files; e.g.: /STORJ/node01/sharddata.kfs/096.s/000077.ldb","timestamp":"2018-08-17T20:26:18.414Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/136.s/000093.ldb","timestamp":"2018-08-17T20:56:09.431Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 85 missing files; e.g.: /STORJ/node01/sharddata.kfs/158.s/000107.ldb","timestamp":"2018-08-17T21:01:58.082Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 3 missing files; e.g.: /STORJ/node01/sharddata.kfs/252.s/000297.ldb","timestamp":"2018-08-17T21:18:42.457Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 85 missing files; e.g.: /STORJ/node01/sharddata.kfs/133.s/000021.ldb","timestamp":"2018-08-17T22:35:49.621Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 3 missing files; e.g.: /STORJ/node01/sharddata.kfs/171.s/000263.ldb","timestamp":"2018-08-17T22:44:38.165Z"}
{"level":"error","message":"Could not get usedSpace: Corruption: 1 missing files; e.g.: /STORJ/node01/sharddata.kfs/223.s/000085.ldb","timestamp":"2018-08-17T22:58:02.523Z"}
`

Hypocritus commented 6 years ago

from another node:

Error: IO error: /ztank002/media/.../8TB005/STORJ3/sharddata.kfs/103.s/006683.ldb: No such file or directory

AlexeyALeonov commented 6 years ago

Hi! Please, check your drive by fsck. The node should be stopped. Also, make sure that your external HDD have enough power. They usually required an external power supply at least via second USB connector.

If this drive is an internal, please check its S.M.A.R.T. Maybe it is dying.

After checking the file system you can try to recover your node:

npm install -g kfs@v3.0.0
Stop the node
kfs -d /STORJ/node01/sharddata.kfs compact when it finished, you can try to start your node. If it would still be corrupted, then better to remove it, its config and data and create a new one.

And please, NEVER delete the database files yourself, lets the system do it in RIGHT manner. This is database, not a bunch of simple file chunks.

littleskunk commented 6 years ago

Duplicate: https://github.com/storj/kfs/issues/55

Hypocritus commented 6 years ago

@AlexeyALeonov,

Ok, performing that now...

Hypocritus commented 6 years ago

@AlexeyALeonov,

Your instructions are indeed working. YAAAYY!! My problem was that I was running kfs ... compact without stopping the node because the process wouldn't complain and it appeared too inconvenient to stop the node for such a long time.

And please, NEVER delete the database files yourself, lets the system do it in RIGHT manner.

I deleted the 4+ month old shards because 'the system' was not removing the shards. In the future, how should I handle this situation?

(I would like to leave some notes for anyone else dealing with this issue)

1) Do NOT run kfs -d /path/to/sharddata.kfs compactwhile you node is still running. Follow Alexey's intructions above. Even though kfs ... compact will run without complaining on a running node, you will have wasted all that time, slowwed your resources, and lowered your stats even more. This has happened to me many times because I was too impatient to stop the node and run kfs ... compact the proper way.

AlexeyALeonov commented 6 years ago

Again, this is not shards. This is 256 leveldb databases! Shards are stored inside the database. Expired shards removes by reaper automatically every 24h. Just let it do it its work. Do not remove database files yourself.

Hypocritus commented 6 years ago

I had terabytes of 2MB and 4MB .ldb files from the month of April within the /sharddata.kfs folders.

Are you telling me that that was valid shard data that Storj was still trying to access on the network?

Wasn't the reaper supposed to have removed that data from the database sometime last month?

AlexeyALeonov commented 6 years ago

It could be data from the other NodeID, if you recreated it to the same data folder. The data of each node is encrypted by its networkPrivateKey. If you recreated the node, you will have a new networkPrivateKey and old data will by unavailable for this new one and will lie idle.

However, this data will be removed by the reaper too, but only without payment.

The creation date of the database file is not related with access date within database. So, those was a valid data, and you probably lost your future money for it.

stefanbenten commented 6 years ago

@Hypocritus Those were part of a database, you cant tell what data is inside those files. If you have removed those files you likely killed your complete database structure and deleted also newer shards, which got deleted by running the compact job.

Further your comment here, is totally unnecessary as Alexey mentioned that you should stop your node before running the task. Please read the intructions carefully next time.

Please ask in the forum next time, before thinkering around with it.

Hypocritus commented 6 years ago

The creation date of the database file is not related with access date within database. So, those was a valid data, and you probably lost your future money for it.

@Hypocritus Those were part of a database, you cant tell what data is inside those files. If you have removed those files you likely killed your complete database structure and deleted also newer shards, which got deleted by running the compact job.

I see... Stupid mistake...

Further your comment here, is totally unnecessary as Alexey mentioned that you should stop your node before running the task. Please read the intructions carefully next time.

The reason this comment exists is because I have often read instructions from the Storj community or staff which fail to make it clear that the node needs to be stopped when running kfs ... compact. It exists to reenforce the correct way when, in their search, people have come upon multiple sets of instructions which differ in the method of execution. I made the warned-of mistake for this reason . The user sometimes chooses the "easier" way, to their own damage.

storj-archived / storjshare-daemon

Deleted 91+day-old shards. Stats now steadily tanking to the "Abyssssss"...."sss" #344