Random access for packs without the whole pack in-memory

MikeBosw commented 11 years ago

We need random access for entries that haven't been encountered yet (pack entries are not ordered), but we don't need to load the whole file into memory for this.

jbrukh commented 11 years ago

Question: if we don't load the packfile, what is the advantage of constantly hitting one file to get objects vs. hitting multiple files to get objects?

Or is it that we can open the file for random access and just keep the file descriptor open for the duration of the program, thereby bypassing overhead for opening?

But still, if we don't have the pack in memory, what is the advantage?

On Wed, Oct 24, 2012 at 12:07 PM, MikeBosw notifications@github.com wrote:

We need random access for entries that haven't been encountered yet (pack entries are not ordered), but we don't need to load the whole file into memory for this.

— Reply to this email directly or view it on GitHubhttps://github.com/jbrukh/ggit/issues/31.

MikeBosw commented 11 years ago

We don't run out of memory.

MikeBosw commented 11 years ago

Sorry, had to write that response in 5 seconds. The current implementation completely parses every pack file, constructing every object, etc. Packs could potentially be gigabytes big. With the current approach, if your object happens to be in a pack, and you run e.g. ggit cat-file -p on it, then cat-file is as heavyweight as launching the Highbridge intranet portal, albeit a much more useful and rewarding endeavor. The suggestion I'm making would take like one millionth the RAM, ditto for CPU cycles.

I'm also interested in the exercise of having ggit - more importantly, ggit's API - handle the case where not all of the repository fits in memory.

jbrukh commented 11 years ago

Ok, understood. That is a great exercise.

At the moment, we're not quite there in the current implementation (though I glanced casually) because you seem to be reading everything into memory anyway, but we can look in more detail when you get in the other changes you wanted to put in.

Overall, this is awesome. It completes object reading and basically means we have a minimal viable library for reading. champagne pop

Jake

On Oct 24, 2012, at 4:45 PM, MikeBosw notifications@github.com wrote:

Sorry, had to write that response in 5 seconds. The current implementation completely parses every pack file, constructing every object, etc. Packs could potentially be gigabytes big. With the current approach, if your object happens to be in a pack, and you run e.g. ggit cat-file -p on it, then cat-file is as heavyweight as launching the Highbridge intranet portal, albeit a much more useful and rewarding endeavor. The suggestion I'm making would take like one millionth the RAM, ditto for CPU cycles.

I'm also interested in the exercise of having ggit - more importantly, ggit's API - handle the case where not all of the repository fits in memory.

— Reply to this email directly or view it on GitHubhttps://github.com/jbrukh/ggit/issues/31#issuecomment-9755794.

MikeBosw commented 11 years ago

WORD YO

jbrukh commented 11 years ago

Let me know when you're going to be ready for CR. I am going to now work on thorough unit tests and benchmarks and general cleanup. After that we can call Milestone 1 complete.

Here are some ideas on where to go next:

Finish up reading the index
Git config parsing
Believe it or not, we can actually start implementing git-diff now

Jake

On Wed, Oct 24, 2012 at 4:54 PM, MikeBosw notifications@github.com wrote:

WORD YO

— Reply to this email directly or view it on GitHubhttps://github.com/jbrukh/ggit/issues/31#issuecomment-9756154.

jbrukh / ggit

Random access for packs without the whole pack in-memory #31