Closed jfontan closed 5 years ago
DISTINCT
should not be used to get the init commits as is way slower than getting all:
testing=# select count(init) from repository_references;
count
-----------
233888927
(1 row)
Time: 53338.143 ms
testing=# select count(distinct(init)) from repository_references;
count
----------
11849984
(1 row)
Time: 1209028.016 ms
To make it more flexible and easier to extend for unforeseen cases there can be commands that create lists and others that do some actions over them. Kind of unix piping style. Most of the list and filter commands can be done with standard unix commands so they may not need to be implemented.
Some examples of commands could be:
tool siva database [<db connection string>]
tool siva disk <directory>
tool siva filter by-bucket <root dir> <list>
tool siva filter repeated <list1> <list2>
tool siva filter only-first <list1> <list2>
tool siva rebucket [--path <rooted-repos>] 0 2 <list>
tool siva delete [--path <rooted-repos>] [--bucket 2] <list>
tool siva repos <list>
tool repos siva <list>
Issues related to implement the tool:
go-billy-gluster
#368
Problems
Possible solutions
List of proposed commands and which ones are needed now
Lost gluster brick
Accessing gluster filesystem to check the presence of a file can be very expensive. Instead of using gluster we can take the list of files directly from the bricks (native FS) and have this list loaded in memory to find missing siva files:
find
in each host and then merged and take out duplicates.Index
inReferences
is already indexed we can get the list of distinctIndex
and find which ones are not in the map.Init
and queue its repositories.Notes
siva files not it buckets
A similar list can be done the same way as with lost gluster brick problem with the bucketed siva files and another one with the ones incorrectly placed. The siva files that are in both lists must be deleted and repositories requeued. The list of siva files that are only in the incorrect path must be moved to its proper bucket dir.
.siva
) that appear in the two lists.Notes
General notes
--dry
option that does not make any changes but tells what are the actions to be taken would be nice as errors here can produce data loss. It could also save the actions to be taken in a replayable file that can be executed after reviewing them. This will save doing some computation twice.