unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.28k stars 1.2k forks source link

Add Google Vertex AI (i.e. Gemini) in PROVIDER_MODELS of config.py #112

Closed huibrian closed 1 month ago

huibrian commented 1 month ago

Thank you for the great work and it is prominent! Previously I used Google Vertex AI (i.e. Gemini) for doing something similar to yours but this repository is way better than mine.

It would be great if you can add Google Vertex AI (i.e. Gemini) in PROVIDER_MODELS of config.py (and add some codes to handle Vertex AI) becoz it is more intelligent on parsing and handling different languages rather than others who can only work well with English content

xenstar commented 1 month ago

Also openrouter.ai aswell.

unclecode commented 1 month ago

Thank you everyone, and thank you, @huibrian , for your nice words. I hope this can be really helpful. For your information, we are using the Litellm library for LLM extraction, which allows us to support almost any of the 100-plus LLM providers, including Google Vortex AI. If you refer to our code, when you are passing the name of the provider, you can pass any name that the Litellm library supports, including Google Vortex. Then, you can pass your instructions and get the result. You're not bound to one or two providers, or anything else. You can even pass Ollama to have a local LLM on your machine or Hugging Face, for example. Open Rotor Light LLM is also supported. If you look at the litellm document for all the providers (https://docs.litellm.ai/docs/providers), we also cover the same things in our library. Here an example of using different provider:

async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
    print(f"\n--- Extracting Structured Data with {provider} ---")

    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    extra_args = {}
    if extra_headers:
        extra_args["extra_headers"] = extra_headers

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider=provider,
                api_token=api_token,
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
                extra_args=extra_args
            ),
            bypass_cache=True,
        )
        print(result.extracted_content)

async main():
    await extract_structured_data_using_llm("gemini/gemini-1.5-pro", os.getenv("GEMINI_API_KEY"))

if __name__ == "__main__":
    asyncio.run(main())