Adding PatternIndex - Githubissues

Pattern Index

This document is separated into 3 parts. The first part is the overview of what Jun's library contains and where you can find it as well as the state of Jun's PLFS code. The second part is potential problems that you may run into. The third part is some questions and answers from a discussion on the PLFS Team email list. The fourth part is code-like version of the interface with comments detailing what occurs in each part.

Part 1: Jun has a library that does the pattern matching algorithm described in his paper. You can find it here: https://github.com/junhe/libpatterndetector. This library consists two files that is used to find patterns. It takes in a list of offsets and a list of lengths and returns the pattern given by those two lists. This covers the basic algorithm that Jun created. It does so by pushing onto a stack the offsets and lengths that are to be used and then making a pattern at some point. You can also look for the "nth" item in the pattern.

His PLFS trunk is here https://github.com/junhe/plfs. For the most part, it uses the same code as the code in the library above. (In fact most of that code is exactly the same). For your reference, the code that is used is in plfs/src/idxanalyzer.h. Jun assumed no overlaping code and therefore, never really used the timestamps or the other information of the index. This is something that will need to be changed.

Part 2: There are two main problems that still need to be figured out when the code comes together. The first is how to do truncate for the PLFS file and the pattern. When it's greater than the size, this is trival. Just add "zero bytes" writes to the end of the buffer. There is now a hole. How, do we represent a hole (besides the boolean value that a hole exists)? In the pattern what do we do? (will it matter?) The next entry will be at a different offset? Or maybe not?

The second problem is overwriting and overlaps of the patterns. When Jun created his code, he assumed that there were no overlaps or overwrites. Because of this, he never needed to keep the timestamps around. Timestamps and other information are needed during overwrites. The main idea here is to detect when there are overwrites occurring and then have some sort of boolean value that indicates that the formula is out of date and will need to be redone.

Part 3: "Q: Is it possible to detect overwrites and not use patterns when overwrites are detected? A: If my understanding is correct (and Chuck can correct me if I'm wrong), there should be a way to detect overwrites. It will require the timestamps though. In those cases, I was thinking about having a boolean that says the pattern is "out of date" and not use it.

Q: Are we detecting and using patterns at write time or at read time? A: Pattern detection does not take a lot of time, so my plan was to detect and create the pattern upon write when the index is closed. The library allows you to just "push" onto its stack offsets and lengths that will be used and then at some point just make the pattern from that list. However, I haven't yet figured out the best case for the reading while writing.

Q: Are we doing this in PLFS lib? Are we thinking about looking for global patterns and doing optimizations at the ad_plfs layer? I don't know yet where we will be doing this. We can look for global patterns. When Jun made his original version, he made a pattern for every pid index on write and then on read made global patterns if he could. So we may want to follow that model. However, again I'm not entirely sure how that will work when an index is open for both read and write.

For end-to-end data integrity, we may in the future add checksums to the index entries. In this case, the user may wish to either use the checksums or the pattern as they cannot use both (i.e. the checksums which presumably are random will prevent collapsing multiple patterned index entries)."

Part 4: /* PLFS Pattern Interface */

/**

indexrecord: structure that contains the result of an index query / struct indexrecord { bool hole; / set to true if we hit a hole / string datapath; / bpath to data dropping _/ struct plfsbackend *databack; / backend datapath resides on_/ off_t chunkoffset; / offset of data in datapath file _/ sizet length; / number of bytes we can read here / bool lastrecord; / true if we hit EOF */ };

/**

ContainerIndex: pure virtual class for container index / class ContainerIndex { virtual plfs_error_t index_open(Container_OpenFile cof, int open_flags) = 0; virtual plfs_error_t index_close(Container_OpenFile cof, int open_flags) = 0; virtual plfs_error_t index_add(Container_OpenFile cof, size_t nbytes, off_t offset, pid_t pid) = 0; virtual plfs_error_t index_sync(Container_OpenFile cof) = 0; virtual plfs_error_t index_query(Container_OpenFile cof, off_t input_offset, size_t input_length, vector &result) = 0; virtual plfs_error_t index_truncate(struct plfs_physpathinfo ppip, off_t offset) = 0; virtual plfs_error_t index_getattr_size(struct plfs_physpathinfo ppip, off_t st_size_p, blkcnt_t st_blocks_p, blksize_t *st_blksize_p) = 0; };

/**

PatternIndex: Pattern instance of a PLFS container index _/ class PatternIndex: ContainerIndex { public: PatternIndex(); /constructor/ ~PatternIndex(); /_destructor*/

plfs_error_t index_open(Container_OpenFile cof, int open_flags); plfs_error_t index_close(Container_OpenFile cof, int open_flags); plfs_error_t index_add(Container_OpenFile cof, size_t nbytes, off_t offset, pid_t pid); plfs_error_t index_sync(Container_OpenFile cof); plfs_error_t index_query(Container_OpenFile cof, off_t input_offset, size_t input_length, vector &result); plfs_error_t index_truncate(struct plfs_physpathinfo ppip, off_t offset); plfs_error_t index_getattr_size(struct plfs_physpathinfo ppip, off_t st_size_p, blkcnt_t st_blocks_p, blksize_t st_blksize_p);

private: RequestPatternList formula; vector bufferedIndices;

Additional states? *
Need a place to keep the buffered results before they are written *
XXX: somewhere in memory/on disk to write all the timestamps and
data that isn't being used in the pattern recognition. *
XXX: a function to write this information to disk to be used when
overwrites happen and the formula needs to be updated - possibly
a boolean value on whether or not the formula is out of date and
needs to be updated. then we can push the update to the read side
rather than the write side and be more in tune with the goals
of keeping checkpointing time down */

/**

PatternIndex::PatternIndex constructor / PatternIndex::Pattern() { /
- This should set up the pattern formula. Not much is done
- here as not much set up is needed */ }

/**

PatternIndex::PatternIndex destructor / PatternIndex::~Pattern() { /
- Sanity check that we aren't closing something that
- shouldn't be closed */ }

/**

PatternIndex::indexopen / plfs_error_t index_open(Container_OpenFile *cof, int openflags) { /
- XXX: where to write the other data that isn't in the formula
- Do we wish to expand the formula for other information (currently
- it does the offsets and the lengths, but not the pid, timestamps, etc)
- or do we wish to just keep that information seperately?
- 1. The index has already been created:
- If the index has already been created and we are writing,
- we will buffer the index records in memory. Aftewards, we
- will run the pattern recognition on them and write out the formula
- 1. what is the open flag? O_RDWR, O_WRONLY, O_RDONLY
- O_RDONLY
- w/o previous index: Nothing to be read. So zero-length file
- w/previous index: get the formula to read from
- XXX: look up the other information and return the index
- O_WRONLY
- w/o previous index: We just want to set up the index write
- buffer that we will catch new entires in memory (up to some
- threshold- preferably most of the indices as the formula will
- be better with more information). When we meet the threshold we
- will continue to push new requests onto the pattern
- on close we will run the formula
- w/previous index: We just continue to add new entries to the
- buffer. When the threshold is met, push these requests onto the
- pattern request list for running the pattern recognition on and
- getting the formula
- O_RDWR
- Not too clear what to do here. Possibly want to just always push writes
- onto the request, but it is a little unclear when to do the formula
- w/o previousy index: ???
- w/previous index: ??? */ }

/**

PatternIndex::indexclose / plfs_error_t index_close(Container_OpenFile *cof, int openflags) { /
- Called when the reference count of the index drops to zero.
- Flush out all buffered unwritten index records as above */

}

/**

PatternIndex::indexadd / plfs_error_t index_add(Container_OpenFile *cof, size_t nbytes, off_t offset, pidt pid) { /
- Adds the index to the buffered indices. (Pushes the requests
- onto the library's "stack" for use when the formula is written) */

}

/**

PatternIndex::indexsync / plfs_error_t index_sync(ContainerOpenFile *cof) { /
- Flush out any leftover buffered indices and write
- the formula - maybe write multiple formulas and then
- combine them as needed throughtout this? (ie. write one
- but if we need to write more get another pattern and then
- run it once on the two patterns and get the final overall pattern) */ }

/**

PatternIndex::indexquery / plfs_error_t index_query(Container_OpenFile *cof, off_t input_offset, size_t input_length, vector
&result) { /_
- Use the formula to get the start of the offset (in position) and
- then use the formula to get the rest of the offsets. It's unclear
- if there is a better way, as we don't know the position of input_offset
- without going through the formula (?)
- XXX: Unclear how to get the other information. Where are we storing it? */

}

/**

PatternIndex::indextruncate / plfs_error_t index_truncate(struct plfs_physpathinfo *ppip, offt offset) { /
- The file may or may not be open.
- This is sort of unclear how it will happen with the pattern. We could
- maybe add zero bytes, which would result in a part of the forumla where
- the deltas are zero and we could get the "hole" back from the formula.
- As for if the size is less than the size of the index, either we can redo
- the formula, or somehow keep the information as to where the final "new"
- offset should now be and check that when we look up the formula */ }

/**

PatternIndex::index_getattrsize / plfs_error_t index_getattr_size(struct plfs_physpathinfo ppip, off_t st_size_p, blkcnt_t st_blocks_p, blksize_t st_blksizep) /
- Container metadata. If the file is open, we should have the lengths
- somewhere saved for the formula, so we can maybe check it that way? */ };

If you have any questions about any of this document, feel free to email me at apeterso@andrew.cmu.edu. I will do my best to get back to you.

plfs / plfs-core

Adding PatternIndex #332