lmmx / impscan

Command line tool to identify minimal imports list and repository sources by parsing package dependency trees
MIT License
0 stars 1 forks source link

Seek on archive streams, avoid requesting all bytes #15

Closed lmmx closed 3 years ago

lmmx commented 3 years ago

To speed up the async procedure over many archives, it'll be better to seek on zip/tar.bz2 streams rather than just download all bytes and then process them synchronously

TODO: make a plain tar (no compression) and see what this looks like (apparently it's ASCII)

lmmx commented 3 years ago

The central directory at the end of a zip is marked by the signature (or "magic number") b"PK\x01\x02". This can be retrieved by requesting the final 500 byte range (or 300 bytes to take a chance)

After this signature (just split on this signature and then parse after it) there are various entries in the central directory struct, of which only the filename length is of interest...?

This is better delegated to the RangeStreams library

lmmx commented 3 years ago

Refer to notes on the structure of zip files

lmmx commented 3 years ago