corruption scenario: large data segments > 16344 bytes

msuhanov commented 8 years ago

Hello.

At present, size of a single full data segment is calculated in the following way: size of a cell - 2*4. But Windows is using the hard-coded value: 16344 (and only the last data segment may have a smaller size: value size - bytes read while parsing previous segments).

joachimmetz commented 8 years ago

"may" does not mean it cannot be something else also see: https://github.com/libyal/libregf/blob/master/documentation/Windows%20NT%20Registry%20File%20(REGF)%20format.asciidoc#66-value-key-data-size---data-block-segments-size-mismatch

msuhanov commented 8 years ago

When you read a big data segment in Windows, you read 16344 bytes for each segment, except the last one. If, for some reason, a single full data segment is residing in a cell that is not 16352 bytes in length (but is larger), your code will produce a different data value compared to Windows.

This is a bug report, not a question.

joachimmetz commented 8 years ago

This is a bug report, not a question.

The goal of libregf is not to match the Windows implementation. If you have a sample where the size is interpreted wrong please attach it to the github issue so I can assess it this is a bug. I would argue that if the a segment contains > 16344 then it is something libregf could mark the value seeing that it would be worth investigating what the additional data contains.

msuhanov commented 8 years ago

I would argue that if the a segment contains > 16344 then it is something libregf could mark the value seeing that it would be worth investigating what the additional data contains.

There is no point in corrupting a data value in the output by including slack space. What if there is an executable stored in the data segments? You will misalign its code. At present, you subtract 4 from a cell size and then subtract 4 again (here: _segment_size = hive_bincell->size - 4;), this throws away 4 bytes of slack space (each data segment usually has 4 extra bytes at the end of a cell), but, in fact, you don't need to ignore the last 4 bytes of a corresponding cell, you need to ignore the last size of a cell - 4 - 16344 bytes when reading a full big data segment.

joachimmetz commented 8 years ago

I have a hard time following you, what you actually mean. Seeing that you did not respond to my request to provide a test file I'll need to have a look if I can craft some test files then. From the perspective of both the format and the Windows implement. To see what the expected behaviour is and how to adequately deal with the scenario of "> 16344" segment sizes. This will highly depend on the amount of available time.

msuhanov commented 8 years ago

Look at the following piece of WRK code to understand what I'm talking about: cmvalue.c. The same code can be found in the NTOS kernel of any recent Windows version using a disassembler.

joachimmetz commented 8 years ago

At first glance, you're talking about a memory backed Registry implementation. As I see no bounds check for the hive cell data what so ever in the routine you refer to. If this routine reads directly from file there are some nice unexpected edge cases e.g. PartialData at the end of the hive bin overflowing the hive bin (page) boundary or if PartialData < CM_KEY_VALUE_BIG which is a scenario that could be easily crafted within a file.

I suspect the bounds/consistency checks are done somewhere else, but don't have the time at the moment to fully dive into this code base. Also the indirection of HvGetCell https://github.com/hacksysteam/WRK-1.2/blob/08ce546e1eff1f14bac093d2609428a93a48b645/base/ntos/config/cmvalue.c#L358 and https://github.com/hacksysteam/WRK-1.2/blob/08ce546e1eff1f14bac093d2609428a93a48b645/base/ntos/config/hive.h#L309 make it tricky to say what the implementation will be actually be doing.

msuhanov commented 8 years ago

I as see no bounds check for the hive cell data what so ever in the routine you refer to.

Windows checks a hive when loads it from a disk. These checks, however, are very relaxed, there are many edge cases when Windows will show the BSOD.

But the thing is that a Windows kernel reads _(Length>CM_KEY_VALUE_BIG)?CM_KEY_VALUEBIG:Length bytes from a cell with a data segment (and the Length variable contains the remaining length of a data value to be read), and your code reads Cell size - 2*4 bytes (when the remaining length of a data value is large enough), and this is not always equal to CM_KEY_VALUE_BIG (but in default allocation scenarios it is).

msuhanov commented 8 years ago

Also, Timothy D. Morgan wrote (see figure 7) that data segments, except the last one, are 16344 bytes in length.

joachimmetz commented 8 years ago

As I said I'll do some tests and see what the results are and based on that information determine what the best approach is.

joachimmetz commented 4 years ago

Created a very large value

libfdata_list_get_element_value: cache: 0x007c9670 hit (5876 out of 65536)
libregf_value_key_read_data: value key data:
00000000: 76 6b 0e 00 b1 fe 0f 00  38 cb b1 00 03 00 00 00   vk...... 8.......
00000010: 01 00 b3 01 56 65 72 79  4c 61 72 67 65 56 61 6c   ....Very LargeVal
00000020: 75 65 65 56                                        ueeV

libregf_value_key_read_data: signature                                  : vk
libregf_value_key_read_data: value name size                            : 14
libregf_value_key_read_data: value data size                            : 0x000ffeb1 (1048241)
libregf_value_key_read_data: value data offset                          : 0x00b1cb38
libregf_value_key_read_data: value type                                 : 0x00000003 (REG_BINARY) Binary data
libregf_value_key_read_data: flags                                      : 0x0001
        Value name is an ASCII string (VALUE_COMP_NAME)

libregf_value_key_read_data: unknown1                                   : 0x01b3 (435)
libregf_value_key_read_data: value name                                 : VeryLargeValue
libregf_value_key_read_data: value name hash                            : 0xd410f87a
libregf_value_key_read_data: padding:
00000000: 65 56                                              eV

Which has segments > 16344

...
libfdata_list_get_element_index_at_offset: element: 5916        mapped range: 0x01b7a000 - 0x01b7c000 (size: 8192)
libfdata_list_get_element_index_at_offset: element: 5915        mapped range: 0x01b79000 - 0x01b7a000 (size: 4096)
libfdata_list_get_element_index_at_offset: element: 5914        mapped range: 0x01b75000 - 0x01b79000 (size: 16384)
libfdata_list_get_element_index_at_offset: element: 5913        mapped range: 0x01b71000 - 0x01b75000 (size: 16384)
libfdata_list_get_element_index_at_offset: element: 5912        mapped range: 0x01b6e000 - 0x01b71000 (size: 12288)
libfdata_list_get_element_index_at_offset: element: 5911        mapped range: 0x01b6d000 - 0x01b6e000 (size: 4096)
...

input data matches output data

This is a bug report, not a question.

Not a bug, WAI. closing issue and marking this issue as invalid.

msuhanov commented 4 years ago

The output you posted has no relation to the bug reported. The library works as intended in the usual scenario, but this is not the one.

joachimmetz commented 4 years ago

@msuhanov then provide me with a test file as requested before. I cannot fix hypothetical issues.

msuhanov commented 4 years ago

$ yarp-print BigDataUnusualLayout | grep -a SSSS | grep -oa S | wc -l
8173

$ grep -oa S BigDataUnusualLayout.reg | wc -l
8173

$ reglookup BigDataUnusualLayout | grep -a SSSS | grep -oa S | wc -l
WARN: Left over chunks detected while constructing big data at offset 0x00001258 (chunk offset 0x00004020).
8173

(Actually, the output of the reglookup tool doesn't match the way how Windows works, the output contains 8172 "S" characters for the data itself, plus one "S" character in the "REG_SZ" type listed. The "real" data from Windows is shown in the exported REG file.)

$ regfexport BigDataUnusualLayout | grep -a SSSS | grep -oa S | wc -l
8172

The files are:

BigDataUnusualLayout: a primary file with the first big data segment adjusted to a larger size using a HEX editor;
BigDataUnusualLayout.reg: a key with an adjusted value exported using Regedit running under Windows 7 (with the hive mounted to HKLM\1);
BigDataUnusualLayout.txt: a hive exported to text using the regfexport tool.

Note that your tool returns the "IN" characters at the end of a value (instead of "S\x00"). BigDataUnusualLayout.zip

joachimmetz commented 4 years ago

BigDataUnusualLayout: a primary file with the first big data segment adjusted to a larger size using a HEX editor;

Thx. So this is a manually manipulated test file, not a file that would be generated by some format edge case.

I'll compare it against the import behavior of Windows to see what the system behavior is and if it allows to for such file to be imported. If Windows does not support such a file I have to conclude the behavior is undefined.

msuhanov commented 4 years ago

Windows works well with that file. The BigDataUnusualLayout.reg file was generated after mounting that file.

joachimmetz commented 4 years ago

Thx, I see what you mean. Tested by importing the hive Windows will skip the additional data. I'll add it as a corruption scenario. Windows might ignore the additional but might be still good to flag or expose this somehow.

joachimmetz commented 4 years ago

Addressed in https://github.com/libyal/libregf/commit/3eedc54fd401551c6410b19dbb519941da701a57 closing issue

libyal / libregf

corruption scenario: large data segments > 16344 bytes #5