monatis / lmm.cpp

Inference of Large Multimodal Models in C/C++. LLaVA and others
MIT License
46 stars 2 forks source link

Adds for autoScroll for crawling the multi pages? #4

Closed SOSONAGI closed 10 months ago

SOSONAGI commented 10 months ago

I just worked for our platform pages with origin code and that couldn't provide me full information on pages.

Therefore, i added autoScroll code in main.ts for this and it worked perfectly. (I think it is better than increasing the numbers of waitForSelectorTimeout.)

async function autoScroll(page: Page) {
  await page.evaluate(async () => {
    await new Promise<void>((resolve, reject) => {
      var totalHeight = 0;
      var distance = 100;
      var timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

if (process.env.NO_CRAWL !== "true") {
  const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
      try {
        if (config.cookie) {
          const cookie = {
            name: config.cookie.name,
            value: config.cookie.value,
            url: request.loadedUrl, 
          };
          await page.context().addCookies([cookie]);
        }

        const title = await page.title();
        log.info(`Crawling ${request.loadedUrl}...`);

        await page.waitForSelector(config.selector, {
          timeout: config.waitForSelectorTimeout,
        });

        await autoScroll(page);  

        const html = await getPageHtml(page);
        await pushData({ title, url: request.loadedUrl, html });

        if (config.onVisitPage) {
          await config.onVisitPage({ page, pushData });
        }

        await enqueueLinks({
          globs: [config.match],
        });
      } catch (error) {
        log.error(`Error crawling ${request.loadedUrl}: ${error}`);
      }
    },
    maxRequestsPerCrawl: config.maxPagesToCrawl,
    // headless: false,
  });

  await crawler.run([config.url]);
}

If you think this is good enough for crawling, hope this will be helpful for other users.

Thank you for your work btw!

I really appreciate for that!

Thank you.

SOSONAGI commented 10 months ago

sorry that was wrong place that i've uploaded. Thank you for your understand and really appreciate for cpp that works greatly on my Mac! Thank you!