Open markmoe19 opened 1 year ago
Good question. At the moment dwalk and the other tools can only read the entire file at once. We'd have to hack together a tool for the streaming bit.
The code that reads the .mfu file (file format version 4) is here:
The loop that unpacks an encoded file entry read from the .mfu
file is here:
The biggest change is that we couldn't unpack the entries into an mfu_flist like this function does, since the flist structure expects to have the full list loaded in memory at once. However, one could look to modify the unpack function to just print the file name instead of inserting the element into the list.
The list_insert_ptr()
function unpacks each element and adds it to the list:
Most of the heavy lifting in parsing the data for each file is in list_elem_unpack
, which shows how the fields in the element are set:
You could perhaps just cut-paste-edit list_insert_ptr
function to have a print_ptr
version that allocates, unpacks, prints the file name, and frees the element, something like:
static size_t print_ptr(char* ptr, int detail, uint64_t chars)
{
elem_t* elem = (elem_t*) MFU_MALLOC(sizeof(elem_t));
size_t bytes = list_elem_unpack(ptr, detail, chars, elem);
printf("%s\n", elem->file);
mfu_free(&elem->file);
mfu_free(&elem);
return bytes;
}
It would be cleaner still to avoid allocating and freeing the element each time.
Actually, after reviewing the code, you may be able to read this file back on the same node using dwalk
.
The current v4 of the .mfu
format stores file names as fixed length fields, where every file name is padded to the longest file name in the set.
That decision makes it easy to seek to a specific entry in the .mfu
file, but it also significantly inflates the .mfu
file size if there are many files and one really long filename:
disk space ~ (numfiles * max(filename length))
When reading the file entries back from the .mfu
file, we copy each file name using strdup
:
The strdup
will drop all of that extra padding, so the same list when read back into memory will take less space than when stored on disk.
If you generated this list from a single node, you might be able to read it back on that single node.
With the eventual v5 .mfu
file format, whenever that comes, I hope we can store file names using variable length structures to avoid this problem. That will likely require the addition of an index to support efficient seeks.
If you find that you can't use dwalk, let me know, and I can hack up a branch with a tool to get you started.
I have a large 1.4TB .mfu file generated by dwalk for 502M items. I want to generate an unsorted text output file from this mfu file.
Does dwalk read the entire mfu file into RAM before outputting the text file? For sorted output, I could see reading into all into RAM. But for unsorted output, could dwalk “stream” the output as it reads the mfu input and thereby not use much ram?
I’m asking because I have a service node that can generate the generate the mfu file but doesn’t have enough ram to generate (unsorted) text output from that same mfu file.
Thanks!