File downloads - Githubissues

philippta commented 11 months ago

File downloaded should be supported as a built-in JavaScript function.

Proposed example:

import { download } from "flyscrape";

export default func({ doc }) {
    const url = doc.find(".download-link").attr("href");

    download(url, "./downloads/file.bin") 
    // or
    download(url, "./downloads/") // File name is inferred from URL or Content-Disposition header.
}

TBD:

How to specify the number of parallel downloads?
Should download be part of the http object from "flyscrape/http" instead?

Ref:

5

TACIXAT commented 11 months ago

Being able to have logic to put different things in different paths based on file type, or generate unique names will be very nice.

As noted, parallel is also important. This is the Python pattern I use (extra janky because it is running in Jupyter) -

def initializer():
    import pathlib as plib
    import requests as reqs

    global pathlib
    pathlib = plib

    global requests
    requests = reqs

def fetch_url(target):
    url, path = target
    if not path.exists():
        resp = requests.get(url)
        if resp.status_code == 200:
            path.write_bytes(resp.content)
        else:
            return ('FAILED', path)
    else:
        return ('SKIPPED', path)

    return ('OK', path)

# targets = ...
with multiprocess.Pool(16, initializer) as p:
    out = p.map(fetch_url, targets)

It would be so convenient to do something like flyscrape --download url-list.txt --workers 16.

import { download } from "flyscrape";

export const config = {
    url: "https://news.ycombinator.com/",
    download-workers: 8,
    scrape-workers: 8,
}

export default func({ doc }) {
    const url = doc.find(".download-link").attr("href");

    download(url, "./downloads/file.bin") 
    // or
    download(url, "./downloads/") // File name is inferred from URL or Content-Disposition header.
}

philippta commented 11 months ago

It would be so convenient to do something like flyscrape --download url-list.txt --workers 16.

I'm not quite sure I understand what you are trying to accomplish. Does the url-list.txt in you hypothetical example contain urls to files?

If so, I'm sure this could be a job for wget -i url-list.txt. Otherwise, do you mind elaborating on this?

TACIXAT commented 11 months ago

I don't believe wget has parallelization built in. I download (and scrape) using 8 or 16 threads in parallel. There are 409 1GB files for LAION's (dataset) image embeddings. It is a pain to download those serially.

philippta / flyscrape

File downloads #6

5