ulikunitz / xz

Pure golang package for reading and writing xz-compressed files
Other
477 stars 45 forks source link

Add ReaderAt #31

Closed frrad closed 3 years ago

frrad commented 4 years ago

This is the change I proposed a while ago in https://github.com/ulikunitz/xz/issues/28 to add a random access xz reader.

It's functional but not super polished yet. Would you be open to an addition like this?

ulikunitz commented 4 years ago

Thanks for the update. It may take me until the weekend to review your code.

ulikunitz commented 4 years ago

Hi Frederick,

Before I can review the code I need to understand what you want to achieve. Apparently you want to have an xz decompressor that supports the ReaderAt interface. It would be helpful to understand what the actual use case is you want to support here. Questions are how large will be the file segments you want to read with ReadAt. Do you want to read the same file from multiple go routines? The answer to those questions is important because they will guide the design of the solution.

The problem with Lempel-Ziv files that they are sequential, because Lempel-Ziv decompression requires that you need to have the complete decompression window available to read the whole file. The only way around this is to split the compressed file stream in independent blocks, so we would need to change the Writer in such a way that is produces those independent blocks. We would need also to decide, whether the blocking should happen on LZMA or xz level.

I'm also thinking it would be easier to make the normal Reader an io.Seeker. Using Seek and Read the ReadAt function would be easy to implement.

If we have agreed the solution design we could agree than on the changes that need to be made to the code.

Kind regards,

Ulrich

frrad commented 4 years ago

Hi Ulrich, thanks for taking the time to review. Responses inline

Before I can review the code I need to understand what you want to achieve. Apparently you want to have an xz decompressor that supports the ReaderAt interface. It would be helpful to understand what the actual use case is you want to support here.

My application needs access to a very large lookup table. Fortunately, the table compresses very well. I would like to be able to access the contents of this table without having to ever store the entire thing either on disk or in memory.

Questions are how large will be the file segments you want to read with ReadAt.

Since I will be creating the original compressed archive, I will be able to control this. It's a tradeoff where larger segments will get better overall compression ratio, but cause slower lookups. Right now I am using 500KiB blocks which seems to give a still-reasonable compression ratio on my data. I haven't really dialed this in yet though.

Do you want to read the same file from multiple go routines? The answer to those questions is important because they will guide the design of the solution.

Right now I am not reading the same file from different goroutines, however it would be nice to have this option in the future if it doesn't add too much complexity to the design here.

The problem with Lempel-Ziv files that they are sequential, because Lempel-Ziv decompression requires that you need to have the complete decompression window available to read the whole file. The only way around this is to split the compressed file stream in independent blocks, so we would need to change the Writer in such a way that is produces those independent blocks. We would need also to decide, whether the blocking should happen on LZMA or xz level.

For my use case it is not important that the writer be able to create these files. I create these read-only lookup tables separately and provide them to my application. I have been creating them like this: pv index.csv | xz --block-size=500KiB -T 0 > index.csv.xz which create a stream of 500KB blocks.

I'm also thinking it would be easier to make the normal Reader an io.Seeker. Using Seek and Read the ReadAt function would be easy to implement.

I know I suggested adding Seek to the Reader initially, but when I actually started implementing this I discovered a rather obvious problem. In order to give consumers of the decompresser the ability to do random reads, the decompressor needs the ability to do random reads on the underlying data. This means that the func NewReader(xz io.Reader) (r *Reader, err error) signature for creating this object is no longer really appropriate and it felt better to just create a new separate type for this. The choice of ReaderAt instead of ReadSeeker for this type is debatable of course. I chose it because it felt more natural to me but don't really have a strong opinion there.

If we have agreed the solution design we could agree than on the changes that need to be made to the code.

Hope this helps you understand my use case and that you think it's an okay fit for your library. I think it could be a useful feature for other users as well

edit: One more thought - even if the random access feature itself is not widely useful, I think it will create the opportunity for increasing decompression speed on files with multiple blocks. It should be easy to tweak this PR so that each block is decompressed in its own goroutine.

Fred

ulikunitz commented 4 years ago

Now I have a lot of questions.

The file stores comma-separated values? Do you have an example of the content?

How do you know the offset to read?

How large is the file? How many records?

It is true that parallel decompression requires independent blocks. That is something I could look into.

frrad commented 4 years ago

The file stores comma-separated values? Do you have an example of the content?

It is just a huge CSV, ordered by one of the columns. Here's a sample:

head index.csv
DOI,ID
10.0000//kronk.spb.ru/library/bolshakov-og-1980.htm,62292786
10.0000//www.ncbi.nlm.nih.gov/pubmed/2963548,77431839
10.0000/0,66471328
10.0000/0-hera.ugr.es.adrastea.ugr.es/generic-B58BB1205BB7,34313043
10.0000/0-hera.ugr.es.adrastea.ugr.es/generic-B7222606114D,32797621
10.0000/00000000,66696625
10.0000/002,66471330
10.0000/003,66471331
10.0000/004,66471332

How do you know the offset to read?

I am basically doing binary search on offset.

How large is the file? How many records?

It's only ~3GB uncompressed right now, but could grow in the future.

wc index.csv
  81381898   81382677 2952600694 index.csv
ulikunitz commented 4 years ago

Thanks for the answers. I understand now better. For a local program you might be better off to seek the file directly, but if you want to run it in a web service or distribute the program using the compressed file would be of course better.

frrad commented 4 years ago

Hi Ulrich,

Just wanted to check in and see how you feeling about this change? No rush if you're still thinking about it.

ulikunitz commented 4 years ago

Frederick, the software will have ReaderAt and I would like to implement it together with parallel processing. I'm currently working on better LZ sequencing including a slow but optimal sequence generators, so there are things happening, which are not visible right now.

frrad commented 4 years ago

Thanks for the update! Let me know if I can help with anything.