sdgathman / lbatofile

Map LBA of disk sector to volume and file pathname.
GNU General Public License v2.0
4 stars 2 forks source link

RAID != 1 formats #2

Open tomato42 opened 7 years ago

tomato42 commented 7 years ago

I've noticed that the script don't support RAID formats different that 1 (libatofile.py:210)

I'm guessing it's because it's fairly complex (different levels, different layouts). But have you found any ideas/starting points for that and just didn't have time to implement it, or you haven't researched it yet?

sdgathman commented 7 years ago

Started to research, realized it would involve reading and interpreting raid metadata, and left it for future development. The only raid we deploy other than 1 is raid10 (mirror + stripe with any number of drives) - so that would be the next one I would tackle. Raid 5 is usually done in hardware, and raid0 generally means you don't care about recovering files anyway.

tomato42 commented 7 years ago

By "interpreting raid metadata" you mean getting to know which RAID device in a MD array the disk is, what's the chunk size, the layout and how many devices are there in the array, or something else/more low-level?

sdgathman commented 7 years ago

The former. Ideally, there would be a cli utility that we could invoke, like we do with lvm and ext. I haven't really hunted around mdadm options and checked for other md raid utils.

sdgathman commented 7 years ago

Ah, we do have:

# mdadm -Q --examine /dev/sdb5
/dev/sdb5:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 6e0ba2c0:33016560:bd00735e:08972090
  Creation Time : Fri Sep 28 08:24:28 2007
     Raid Level : raid1
  Used Dev Size : 121852928 (116.21 GiB 124.78 GB)
     Array Size : 121852928 (116.21 GiB 124.78 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1

    Update Time : Sun Jun 11 15:33:20 2017
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : b74e1de2 - correct
         Events : 10317

      Number   Major   Minor   RaidDevice State
this     0       8       21        0      active sync   /dev/sdb5

   0     0       8       21        0      active sync   /dev/sdb5
   1     1       8        5        1      active sync   /dev/sda5
sdgathman commented 7 years ago

Here is output from a raid10 array:

/dev/sdb2:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x1
     Array UUID : 3e0ce7f0:9ec7691c:34106d4d:bd3e45ae
           Name : node0.example.com:1  (local to host node0.example.com)
  Creation Time : Mon Jun 24 11:02:17 2013
     Raid Level : raid10
   Raid Devices : 4

 Avail Dev Size : 1952471040 (931.01 GiB 999.67 GB)
     Array Size : 1952470016 (1862.02 GiB 1999.33 GB)
  Used Dev Size : 1952470016 (931.01 GiB 999.66 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
   Unused Space : before=1976 sectors, after=1024 sectors
          State : clean
    Device UUID : 0489bfb4:86370771:66ed949d:6a141ee1

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Jun 11 15:37:17 2017
       Checksum : fd2f809c - correct
         Events : 351943

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
tomato42 commented 7 years ago

From what I can tell, all the necessary information is in the /proc/mdstat.

To map a disk sector to sector in MD, we need raid level, size of chunk, layout and disk number in array. The only thing that's not direct is the "data offset" from mdadm --examine.

BTW, I figured out the algorithm to map disk LBA to MD LBA for RAID 6 left-symmetric:

def raid6(disk_n, lba, chunk_in_lba, stripe_width, layout):
    """
    chunk_in_lba size of the chunk as a count of LBAs (512B sectors)
    """
    if layout != "left-symmetric":
        raise ValueError("Only left-symmetric is supported")

    # which stripe is the lba in? and how many lba's into the chunk the block is
    stripe_n, block_n = divmod(lba, chunk_in_lba)
    print "stripe number: %s, block in stripe: %s" % (stripe_n, block_n)

    # which disk does the P block fall on in the stripe?
    p_pos = (stripe_width - 1 - stripe_n) % stripe_width
    print "Disk of the P block: %s" % p_pos

    # what's the chunk number in that stripe for the given disk
    chunk = (disk_n + stripe_width - p_pos - 2) % stripe_width
    print "Chunk number of the disk: %s" % chunk
    if chunk >= (stripe_width - 2):
        print "Warning: parity chunk! Returning first data chunk for the stripe"
        chunk = 0

    overall_chunk_n = stripe_n * (stripe_width - 2) + chunk
    print "Logical chunk number: %s" % overall_chunk_n

    print "Array LBA: %s" % (overall_chunk_n * chunk_in_lba + block_n)
    return overall_chunk_n * chunk_in_lba + block_n

I wonder how parity blocks should be handled though...

sdgathman commented 7 years ago

Parity blocks are the same as data blocks. For simplicity, consider a raid 5 with 3 drives. A given block is composed of 3 lower level blocks - any two of which can be used to compute the block seen by the upper level. But that is irrelevant for our purposes - all 3 blocks map to the same upper level block (which is twice as big as the lower level blocks).

tomato42 commented 7 years ago

Linux MD RAID doesn't checksum data on read, it does that only on actual error returned from disk or when the array is degraded. In other words, it won't detect silent data corruption.

So for data drives, the mapping is one to one, but for parity is one to many.

That means, that if we hit a file, the rewriting that file is all that's necessary to rewrite the parity block (and no data was damaged). Problem is if the first data chunk in stripe is free space, but not the second or third...

sdgathman commented 7 years ago

One wrinkle is that the upper level blocks are generally bigger than lower level blocks - which are generally bigger than physical sectors. So that bad sector is going to map to multiple sectors in an upper level block. And likely, multiple filesystem blocks (and maybe even multiple files).

sdgathman commented 7 years ago

Actually, Linux MD RAID does checksum data when you do the check operation - which on RHEL and Fedora runs once a week by default (configurable). This works admirably for auto-repairing bad sectors on production systems - even with one drive failing.

Also, on normal read, it will pick any sufficient subset of the drives in a RAID - depending on how busy each drive is. So it often randomly encounters read errors - and will read the data from another drive at that point. I believe it doesn't rewrite the bad sector by default due to concurrency issues (another write may already be on the way), but it does log it.

tomato42 commented 7 years ago

One wrinkle is that the upper level blocks are generally bigger than lower level blocks - which are generally bigger than physical sectors.

How is that a problem? I'd say that's actually what makes it easy for RAID devices, as the atomicity is still the LBA. Can't have data from two files in a single LBA. So if we map the physical LBA of the HDD to the logical LBA of the MD, we'll find just one file there (and that's exactly the mapping that the above code does).

LVM extents work exactly the same, they're just larger than MD RAID chunks.

Or to put it other way, if you take RAID 6 array and put the data blocks in order (1, 2, 3, etc.), you will see exactly what is presented as the MD device.

Actually, Linux MD RAID does checksum data when you do the check operation - which on RHEL and Fedora runs once a week by default (configurable).

Yes, on check, not with every read. So if the error is isolated and fallen on the parity block, you can be sure the data read from the array is trustworthy as the error will simply be ignored and not impact operation.

Also, on normal read, it will pick any sufficient subset of the drives in a RAID - depending on how busy each drive is. So it often randomly encounters read errors - and will read the data from another drive at that point.

So, you're saying that it may sometimes read parity block instead of data block and calculate the missing data block from all the other data blocks and the parity block? OK. Now, if that parity block read fails, it will fall back on an actual data block and continue with that? How does that change what I said - errors in parity blocks are irrelevant for read operations?

sdgathman commented 7 years ago

BTW, before we start arguing too heatedly - Thanks for taking an interest!

I think I'm trying to say that all the low level blocks are equally important, there is nothing special about the parity block (other than how the logical block is computed when used). But that is a side issue to the purpose of lbatofile.

You are correct - I hadn't realized that when it comes to LBAs, multiple low level LBAs map to the same upper level LBA, and the blocks size is not really relevent (except for computing said mapping).

If we get many more new mappings into a release, lbatofile might need a plugin architecture, so we can put mappings in their own file and have them registered with the main loop.

I only use raid1 and raid10 - but I can create a test raid6 array on an LV to test I think.

tomato42 commented 7 years ago

BTW, before we start arguing too heatedly

well, I could have been wrong, wouldn't know that without arguing, but I'd say we reached common ground so we both learned something, and that's nice :)

lbatofile might need a plugin architecture, so we can put mappings in their own file and have them registered with the main loop.

the differences between different RAID 6 layouts are rather small, something that probably will require adding 2 maybe 3 if statements to the algorithm above, similar for raid10. Using plugin architecture for essentially 12 lines of code is a bit much, IMHO. But let's first implement minimum, see if that estimate is correct and make a decision then.

I only use raid1 and raid10 - but I can create a test raid6 array on an LV to test I think.

For testing I'm actually using sparse files and loop devices :)

sdgathman commented 7 years ago

For testing I'm actually using sparse files and loop devices :)

Good idea!