unclecode / crawl4ai

🔥🕷️ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
https://crawl4ai.com
Apache License 2.0
17.01k stars 1.26k forks source link

Extracting data from an iframe? #99

Closed ehubb20 closed 1 month ago

ehubb20 commented 2 months ago

What is the best method for extracting data from an iframe using Crawl4ai?

Here is an example of the iframe I am trying to capture:

<div class="list-items new_properties_scroll"><ul><li><div class="list-item-des"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&amp;companyID=13160&amp;source=iframe" target="_blank"></a><div class="container-fluid" style="max-width:1399px;"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&amp;companyID=13160&amp;source=iframe" target="_blank"></a><div class="row item"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&amp;companyID=13160&amp;source=iframe" target="_blank"><div class="col-md-2 col-sm-2"><div style="background-image: url(https://s3.amazonaws.com/Rently_dev/images/51453851/medium);"></div></div><div class="col-md-4 col-sm-4 col-xs-4 basic-info"><div class="price priceWithTooltip"><h2><span class="amount">$1757</span><span class="unit"> / month</span></h2></div><div class="available-date"><h2>Available: Now</h2></div><span class="mini-address">231 Crestview Way, Dallas, GA, 30132, Un...</span><div class="info"><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/bed.svg"><span><strong style="font-size: 1.5em;">3</strong> Bed(s)</span></div><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/shower-head.svg"><span><strong style="font-size: 1.5em;">2.5</strong> Bath(s)</span></div><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/cat_dog.svg"><span style="line-height: 2.2;"> Cat + Dog</span></div><div class="col-md-6 col-sm-3 col-xs-3" style="line-height: 30px;"><img class="center" src="/assets/sq_ft.svg"><span>1530 Sq ft</span></div>

unclecode commented 2 months ago

@ehubb20 Let me check snd update your.

b-sai commented 1 month ago

Hey @unclecode any update on this? I too am trying to figure out how to parse iframe content

shhivam commented 1 month ago

Any update on this?

unclecode commented 1 month ago

Hello Everyone @ehubb20 @b-sai @shhivam sorry for the late reply. We've been very busy bringing a lot of new features, and one of them is actually extracting the kind of information from the iframe. It's still early days, so it's going to be with the new version 0.3.6, which we're going to release by tomorrow. I definitely expect some bugs, so please use it and report any issues you come across, and we can fix them right away.

It currently extracts the content of the "body" of the iframe, replaces it with a div element in the main page, making it part of the main page. You can think of it as a way of flattening, but what we extract is the body content of the iframe. We plan to add more options and parameters for extracting these elements.

Btw without that when you crawl a page, you get all internal/external links and then scrape those links for iframes again. This already provides a lot of options.

Anyway I've shared a sample of the code with you here. Hopefully, when we update the library, you'll be able to use it. I appreciate it if you could let us know about any bugs or issues you encounter.

async def main():
    async with AsyncWebCrawler(verbose=True, headless = False) as crawler:
        url = "https://zcgwq2-5000.csb.app"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            process_iframes=True
        )

I keep the issue open, in case you face with any error.

shhivam commented 1 month ago

Thanks @unclecode for the clarification!

unclecode commented 1 month ago

@shhivam The iframe extraction is already available, please check:

async def test_oframe():
    async with AsyncWebCrawler(verbose=True, headless = False) as crawler:
        url = "URL-HERE"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            process_iframes=True
        )