plfs / plfs-core

LANL no longer develops PLFS. Feel free to fork and develop as you wish.
41 stars 36 forks source link

Adding PatternIndex #332

Open johnbent opened 10 years ago

johnbent commented 10 years ago

I'm attaching Annika's writeup of the steps needed to get pattern index into the code.

johnbent commented 10 years ago

Pattern Index

This document is separated into 3 parts. The first part is the overview of what Jun's library contains and where you can find it as well as the state of Jun's PLFS code. The second part is potential problems that you may run into. The third part is some questions and answers from a discussion on the PLFS Team email list. The fourth part is code-like version of the interface with comments detailing what occurs in each part.

Part 1: Jun has a library that does the pattern matching algorithm described in his paper. You can find it here: https://github.com/junhe/libpatterndetector. This library consists two files that is used to find patterns. It takes in a list of offsets and a list of lengths and returns the pattern given by those two lists. This covers the basic algorithm that Jun created. It does so by pushing onto a stack the offsets and lengths that are to be used and then making a pattern at some point. You can also look for the "nth" item in the pattern.

His PLFS trunk is here https://github.com/junhe/plfs. For the most part, it uses the same code as the code in the library above. (In fact most of that code is exactly the same). For your reference, the code that is used is in plfs/src/idxanalyzer.h. Jun assumed no overlaping code and therefore, never really used the timestamps or the other information of the index. This is something that will need to be changed.

Part 2: There are two main problems that still need to be figured out when the code comes together. The first is how to do truncate for the PLFS file and the pattern. When it's greater than the size, this is trival. Just add "zero bytes" writes to the end of the buffer. There is now a hole. How, do we represent a hole (besides the boolean value that a hole exists)? In the pattern what do we do? (will it matter?) The next entry will be at a different offset? Or maybe not?

The second problem is overwriting and overlaps of the patterns. When Jun created his code, he assumed that there were no overlaps or overwrites. Because of this, he never needed to keep the timestamps around. Timestamps and other information are needed during overwrites. The main idea here is to detect when there are overwrites occurring and then have some sort of boolean value that indicates that the formula is out of date and will need to be redone.

Part 3: "Q: Is it possible to detect overwrites and not use patterns when overwrites are detected? A: If my understanding is correct (and Chuck can correct me if I'm wrong), there should be a way to detect overwrites. It will require the timestamps though. In those cases, I was thinking about having a boolean that says the pattern is "out of date" and not use it.

Q: Are we detecting and using patterns at write time or at read time? A: Pattern detection does not take a lot of time, so my plan was to detect and create the pattern upon write when the index is closed. The library allows you to just "push" onto its stack offsets and lengths that will be used and then at some point just make the pattern from that list. However, I haven't yet figured out the best case for the reading while writing.

Q: Are we doing this in PLFS lib? Are we thinking about looking for global patterns and doing optimizations at the ad_plfs layer? I don't know yet where we will be doing this. We can look for global patterns. When Jun made his original version, he made a pattern for every pid index on write and then on read made global patterns if he could. So we may want to follow that model. However, again I'm not entirely sure how that will work when an index is open for both read and write.

For end-to-end data integrity, we may in the future add checksums to the index entries. In this case, the user may wish to either use the checksums or the pattern as they cannot use both (i.e. the checksums which presumably are random will prevent collapsing multiple patterned index entries)."

Part 4: /* PLFS Pattern Interface */

/**

/**

/**

private: RequestPatternList formula; vector bufferedIndices;

/*

/**

/**

/**

/**

}

/**

}

/**

/**

}

/**

/**

If you have any questions about any of this document, feel free to email me at apeterso@andrew.cmu.edu. I will do my best to get back to you.