Closed sissbruecker closed 3 months ago
Can also reproduce it with the Deno build, so maybe not an issue with the Deno polyfill.
Thank you, I fixed the issue by checking the number of opened tabs just before closing the one used to capture a page.
Thanks for looking into it! I just pulled the latest changes, however I still see the problem. I'm using:
concurrently "node single-file-node.js https://example.com 1.html" "node single-file-node.js https://www.zemismart.com/products/mt25b 2.html"
Meanwhile I was looking into simple-cdp
and was wondering it it would be a problem that two Chromium instances are started with the same debugging port.
Indeed, I confirm the app has not be designed for this. You're supposed to pass a list of URLs via a file.
OK thanks, I'll then adapt the task queue to only run one single-file task at a time.
FYI, I was referring to --urls-file
Does that process URLs in parallel? If so that would be interesting. I did take a brief look at the docs, but I couldn't figure out if I can configure --filename-template
so that I can correlate the generated files with the URLs that were written into the URLs file.
Edit: Looking at runTasks it seems to schedule multiple tasks at once.
it should be to process multiple tabs in parallel, cf --max-parallel-workers
.
single-file --urls-file=urls.txt --filename-template={url-href-flat}.html
should work for getting the URLs back, plus handling some edge-cases where the filename gets truncated.
Yes, this is your best option if you want the URL in the filename. Note that you can also find it in the comment inserted by SingleFile in the saved page (after the <html>
tag).
This issue is now really fixed in the version v2.0.30. This was related to the debugging port of Chrome.
Thanks! I'll try to some testing for this this soonish and see if I can make it work with the task queue.
@sissbruecker if it helps you, this is the code I'm using to achieve the same (I took some inspiration from linkding):
class Item(BaseModel):
url: str
id: str
class Payload(BaseModel):
items: list[Item]
def match_url(urls, text):
pattern = r"\b(?:{})\b".format("|".join(re.escape(url) for url in urls))
match = re.search(pattern, text)
return match.group(0) if match else None
@router.post("/download")
async def download(payload: Payload):
url_file_path = "/tmp/url_file"
output_path = "/tmp/outputs"
url_param = [
f"--urls-file={url_file_path}",
f"--output-directory={output_path}",
"--filename-template={url-href-flat}.html",
]
args = [SINGLEFILE_PATH] + SINGLEFILE_OPTIONS + url_param
urls = {item.url: item.id for item in payload.items}
# write url_file
with open(url_file_path, "w") as file:
file.write("\n".join(urls.keys()))
# ensure output dir is empty
if not os.path.exists(directory):
os.makedirs(directory)
for file_name in os.listdir(directory):
file_path = os.path.join(directory, file_name)
os.unlink(file_path)
try:
process = subprocess.Popen(args, start_new_session=True)
process.wait(timeout=SINGLEFILE_TIMEOUT_SEC)
processed = []
for file_name in os.listdir(output_path):
file_path = os.path.join(output_path, file_name)
with open(file_path, "rb") as file:
lines = file.readlines()[:5]
first_five_lines = b"".join(lines).decode("utf-8")
url = match_url(urls.keys(), first_five_lines)
if not url:
continue
file.seek(0)
upload(file.read(), f"{urls[url]}.html") # do whatever with the file now
processed.append(url) # you can use this after the loop to validate if all urls have been processed
os.remove(file_path)
I'm using single-file-cli (with Node) in a task queue with multiple workers, where multiple instances of single-file can be spawned in parallel. For example:
(Using concurrently because I don't know if I would provide a wrong reproduction with bash otherwise)
The command above sometimes succeeds to save both pages, but in most cases only the example.com URL will be saved, and the process for the other URL just stalls indefinitely.
I did some debugging, and it seems that the call to
closeBrowser
of the first process can cause the other process to stall here: https://github.com/gildas-lormeau/single-file-cli/blob/95404bc5c3cabc99f1cccc2a9ad3346da53bc02f/lib/cdp-client.js#L105-L110The issue seems to go away if I introduce a timeout into
closeBrowser
, so the second URL can finish capturing:Edit: Updated first URL to a more minimal page that maybe makes the issue easier to reproduce.