Closed philippta closed 11 months ago
Being able to have logic to put different things in different paths based on file type, or generate unique names will be very nice.
As noted, parallel is also important. This is the Python pattern I use (extra janky because it is running in Jupyter) -
def initializer():
import pathlib as plib
import requests as reqs
global pathlib
pathlib = plib
global requests
requests = reqs
def fetch_url(target):
url, path = target
if not path.exists():
resp = requests.get(url)
if resp.status_code == 200:
path.write_bytes(resp.content)
else:
return ('FAILED', path)
else:
return ('SKIPPED', path)
return ('OK', path)
# targets = ...
with multiprocess.Pool(16, initializer) as p:
out = p.map(fetch_url, targets)
It would be so convenient to do something like flyscrape --download url-list.txt --workers 16
.
import { download } from "flyscrape";
export const config = {
url: "https://news.ycombinator.com/",
download-workers: 8,
scrape-workers: 8,
}
export default func({ doc }) {
const url = doc.find(".download-link").attr("href");
download(url, "./downloads/file.bin")
// or
download(url, "./downloads/") // File name is inferred from URL or Content-Disposition header.
}
It would be so convenient to do something like
flyscrape --download url-list.txt --workers 16
.
I'm not quite sure I understand what you are trying to accomplish. Does the url-list.txt in you hypothetical example contain urls to files?
If so, I'm sure this could be a job for wget -i url-list.txt
. Otherwise, do you mind elaborating on this?
I don't believe wget
has parallelization built in. I download (and scrape) using 8 or 16 threads in parallel. There are 409 1GB files for LAION's (dataset) image embeddings. It is a pain to download those serially.
File downloaded should be supported as a built-in JavaScript function.
Proposed example:
TBD:
download
be part of the http object from "flyscrape/http" instead?Ref:
5