jegesh / python-random-access-file-reader

Low memory usage random access reader for csv and general files
14 stars 6 forks source link

[feature request] Saving a file index to avoid recomputing each time #4

Open zjijz opened 6 years ago

zjijz commented 6 years ago

@jegesh Could we add the ability for a file index to be saved to disk to avoid indexing each time a file is opened? I've been using this package for machine learning batches, and indexing the file each training run has been noticeable.

A similar package called linereader saves the index automatically.

I can help implement this too.

carvetighter commented 6 years ago

would you recommend pickling the index, possibly in the same directory as the file? a possible file extension could be "*.idx".

zjijz commented 6 years ago

@carvetighter Pickling could work. Is there some fact about the index structure that could let it be compressed more?

carvetighter commented 6 years ago

@zjijz I don't know about compressing the index. It was just an idea. Do you want to access the index information quickly?

I'm looking at the linereader code and it's interesting how he counts the lines and makes every line the same length by padding with spaces at the end in the index file. It's always hard reading someone else's code. I don't understand why he is doing some things. Like the index file which in an integer than a lot of spaces after (e.g. '32 ...a bunch of spaces... \n'). It just seems odd to me. If you pickle the index then you can just load it and use it easily.

jegesh commented 6 years ago

A pull request would be well received. If neither of you have the time for it, maybe I can put something basic together.

zjijz commented 6 years ago

Hey, sorry about the delay. I was working on a school project that would use this feature but the class ended and some other workloads piled up. Do you have a date you would want a version of this done by?

jegesh commented 6 years ago

You requested it, so you tell me!