gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
539 stars 57 forks source link

Running multiple instances of single-file in parallel can cause one process to stall #78

Closed sissbruecker closed 3 months ago

sissbruecker commented 3 months ago

I'm using single-file-cli (with Node) in a task queue with multiple workers, where multiple instances of single-file can be spawned in parallel. For example:

npm i -g single-file-cli concurrently

concurrently "single-file https://example.com" "single-file https://www.zemismart.com/products/mt25b 2.html"

(Using concurrently because I don't know if I would provide a wrong reproduction with bash otherwise)

The command above sometimes succeeds to save both pages, but in most cases only the example.com URL will be saved, and the process for the other URL just stalls indefinitely.

I did some debugging, and it seems that the call to closeBrowser of the first process can cause the other process to stall here: https://github.com/gildas-lormeau/single-file-cli/blob/95404bc5c3cabc99f1cccc2a9ad3346da53bc02f/lib/cdp-client.js#L105-L110

The issue seems to go away if I introduce a timeout into closeBrowser, so the second URL can finish capturing:

async function closeBrowser() {
+       await new Promise(resolve => setTimeout(resolve, 5000));
    if (child !== undefined) {
        child.kill();
        await child.status;
        child = undefined;
    }
        ...
}

Edit: Updated first URL to a more minimal page that maybe makes the issue easier to reproduce.

sissbruecker commented 3 months ago

Can also reproduce it with the Deno build, so maybe not an issue with the Deno polyfill.

gildas-lormeau commented 3 months ago

Thank you, I fixed the issue by checking the number of opened tabs just before closing the one used to capture a page.

sissbruecker commented 3 months ago

Thanks for looking into it! I just pulled the latest changes, however I still see the problem. I'm using:

concurrently "node single-file-node.js https://example.com 1.html" "node single-file-node.js https://www.zemismart.com/products/mt25b 2.html"

Meanwhile I was looking into simple-cdp and was wondering it it would be a problem that two Chromium instances are started with the same debugging port.

gildas-lormeau commented 3 months ago

Indeed, I confirm the app has not be designed for this. You're supposed to pass a list of URLs via a file.

sissbruecker commented 3 months ago

OK thanks, I'll then adapt the task queue to only run one single-file task at a time.

gildas-lormeau commented 3 months ago

FYI, I was referring to --urls-file

sissbruecker commented 3 months ago

Does that process URLs in parallel? If so that would be interesting. I did take a brief look at the docs, but I couldn't figure out if I can configure --filename-template so that I can correlate the generated files with the URLs that were written into the URLs file.

Edit: Looking at runTasks it seems to schedule multiple tasks at once.

gildas-lormeau commented 3 months ago

it should be to process multiple tabs in parallel, cf --max-parallel-workers.

sissbruecker commented 3 months ago

single-file --urls-file=urls.txt --filename-template={url-href-flat}.html should work for getting the URLs back, plus handling some edge-cases where the filename gets truncated.

gildas-lormeau commented 3 months ago

Yes, this is your best option if you want the URL in the filename. Note that you can also find it in the comment inserted by SingleFile in the saved page (after the <html> tag).

gildas-lormeau commented 2 months ago

This issue is now really fixed in the version v2.0.30. This was related to the debugging port of Chrome.

sissbruecker commented 2 months ago

Thanks! I'll try to some testing for this this soonish and see if I can make it work with the task queue.

masylum commented 1 month ago

@sissbruecker if it helps you, this is the code I'm using to achieve the same (I took some inspiration from linkding):

class Item(BaseModel):
    url: str
    id: str

class Payload(BaseModel):
    items: list[Item]

def match_url(urls, text):
    pattern = r"\b(?:{})\b".format("|".join(re.escape(url) for url in urls))
    match = re.search(pattern, text)
    return match.group(0) if match else None

@router.post("/download")
async def download(payload: Payload):
    url_file_path = "/tmp/url_file"
    output_path = "/tmp/outputs"
    url_param = [
        f"--urls-file={url_file_path}",
        f"--output-directory={output_path}",
        "--filename-template={url-href-flat}.html",
    ]
    args = [SINGLEFILE_PATH] + SINGLEFILE_OPTIONS + url_param
    urls = {item.url: item.id for item in payload.items}

    # write url_file
    with open(url_file_path, "w") as file:
        file.write("\n".join(urls.keys()))

    # ensure output dir is empty
    if not os.path.exists(directory):
      os.makedirs(directory)
    for file_name in os.listdir(directory):
        file_path = os.path.join(directory, file_name)
        os.unlink(file_path)

    try:
        process = subprocess.Popen(args, start_new_session=True)
        process.wait(timeout=SINGLEFILE_TIMEOUT_SEC)

        processed = []

        for file_name in os.listdir(output_path):
            file_path = os.path.join(output_path, file_name)
            with open(file_path, "rb") as file:
                lines = file.readlines()[:5]
                first_five_lines = b"".join(lines).decode("utf-8")
                url = match_url(urls.keys(), first_five_lines)

                if not url:
                    continue

                file.seek(0)
                upload(file.read(), f"{urls[url]}.html") # do whatever with the file now
                processed.append(url) # you can use this after the loop to validate if all urls have been processed

            os.remove(file_path)