platonai / PulsarRPA

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed AI-RPA.
GNU Affero General Public License v3.0
754 stars 116 forks source link

浏览器资源不会释放 #77

Open suntsao opened 1 month ago

suntsao commented 1 month ago

环境

关键业务代码

PulsarSession session = PulsarContexts.createSession();
LoadOptions options = session.options("-parse -refresh");
options.getEvent().getBrowseEventHandlers().getOnDocumentActuallyReady().addLast((page, driver, other) -> {
    FeaturedDocument loadDocument = session.parse(page);
    CheckStatus checkStatus = CheckStatus.SUCCESS;
    for (SiteStatusCheckHandle handler: handlers) {
        checkStatus = handler.process(page, loadDocument);
        if(CheckStatus.SUCCESS != checkStatus){
            break;
        }
    }

    System.out.printf(page.getUrl() + " -> " + checkStatus + "\n");
    siteCheckService.handleCheckResultForUrl(page.getUrl(), checkStatus);
    return loadDocument;
});

List<String> urls = this.siteMapper.getAllSite().stream().map(siteModel -> siteModel.getUrl()).toList();
session.loadAllAsync(urls, options);
session.getContext().await();
session.close();

问题描述

服务器内存12G,采集检查的 URL 总共才 1600 条,基本上到1000 条左右,就因为内存不够无法执行; 查了一下应该是浏览器的资源没有释放;

不知道我这种写法是不是有什么问题?没有正确释放资源?

platonai commented 3 weeks ago

Use session.submit() instead of session.loadAllAsync().