jjjake / internetarchive

A Python and Command-Line Interface to Archive.org
GNU Affero General Public License v3.0
1.58k stars 217 forks source link

Skip early when downloading existing file #650

Closed ChlodAlejandro closed 3 months ago

ChlodAlejandro commented 3 months ago

614 moved some skip checks to until after response headers have been received, which drastically slows down the download process if the file already exists or the file has an equal checksum.

Since whether existing files (--ignore-existing) or matching checksums (--checksum) are to be skipped, file name, and checksum are all already known prior to needing information about the file through the Last-Modified header, these checks should remain at the start to avoid having to make a request which would eventually be discarded anyway. This speeds up ia download by skipping the (relatively long-running) blocking HTTP request and also stops the script from making numerous wasted requests to the Internet Archive, especially for runs which cover hundreds of files.

jjjake commented 3 months ago

Looks good, thank you @ChlodAlejandro!