meta-llama / llama-models

Utilities intended for use with Llama models.
Other
4.88k stars 839 forks source link

Infinite file growth when downloading checkpoints in chunks #129

Open Mooon opened 2 months ago

Mooon commented 2 months ago

I’m encountering an issue with the download script where it enters an infinite loop during the chunking process, resulting in files that grow indefinitely and the download never completes.

This happens when downloading big models, like the 405B-MP16 version, where each checkpoint (consolidated.XX.pth) is downloaded in chunks. The script should correctly download each chunk, concatenate them, and then complete the download process without entering an infinite loop. However, the script instead continuously downloads chunks without ever completing, causing the files to grow indefinitely in size.

Potential fix: I was able to work around the issue by simplifying the process. Instead of downloading each consolidated.XX.pth file in chunks, I modified the script to download each file directly, without splitting it into chunks. Given that each checkpoint file is up to 48GB in size, this approach is manageable on systems with sufficient resources.

To implement this fix, set the variable PTH_FILE_CHUNK_COUNT=0. Additionally, I parallelized the downloads of the checkpoint files, which reduces the overall download time and simplifies the script.

Modified Script:

if [[ $PTH_FILE_COUNT -ge 0 ]]; then
    for s in $(seq -f "%02g" 0 ${PTH_FILE_COUNT}); do
        (
            printf "Downloading consolidated.${s}.pth\n"
            wget --continue ${PRESIGNED_URL/'*'/"${MODEL_PATH}/consolidated.${s}.pth"} -O ${TARGET_FOLDER}"/${MODEL_PATH}/consolidated.${s}.pth"
        ) &
    done

    # Wait for all file downloads to complete
    wait
fi   

I recognize that this solution may not be suitable for all users, particularly those on systems with limited resources. For this reason, it am opening the issue to consider alternative solutions or to provide additional options for users with different system capabilities.