Add some textual context to images, videos and audio in `result.media`, so that it can be useful in RAG

aravindkarnam commented 3 months ago

Issue

Currently, all images, video and audio in the crawled page are returned in result.media like follows.

{
audio:[],
images:[
{
src:"https://cdn.com/path/to/image",
alt:"mobile_integrations_560px",
type:"image"
},
.
.
.
],
videos:[]
}

Each entry in this object list contains a link/url from where the asset could be downloaded, the text passed in the alt attribute of the corresponding <img> HTML tag and a type field(since the entries are already grouped by audio, video and images not sure why we need the type field again)

Other than this there's no useful information regarding what may be the content of the media file (It could be architecture diagram, testimonial, partners, integrations etc).

So we have to entirely rely on the alt attribute to vectorise and store against it the link to the media file so that it can be retrieved during RAG. Also theres many times where I found utterly garbage if not misleading values in alt attribute of image tags of even well known companies.

I'm talking about a use case like perplexity (but for a company internal knowledge base etc)

Screenshot 2024-06-20 at 4 09 55 PM

Suggestion

I suggest that we add some textual context based on the position ie where the image is found with in the page. Perhaps the closest paragraph or header (right above the image). @unclecode please let me know your thoughts on this or if there's a better way to do RAG on images, for user queries.

unclecode commented 3 months ago

Very interesting point that you mentioned, and it's actually been on my mind. Initially, I started by just extracting the media tags to have them ready. I have two strategies in mind.

First, I'm looking for a very small image-to-text model to annotate images and create descriptions. I've found a few or am working on one myself. This model needs to be efficient, not large, as we only need small captions and want to run it in parallel without heavy GPU dependence. There are many media sources on a page and need to be fast, otherwise it will be time consuming.

Second, there's a statistical NLP approach uses TF-IDF to gather information around the image, pulling in relevant text from the website, maybe even from other parts that refer to the image initially. This doesn't need a model, just a statistical method.

Since you brought this up, I'll start with the statistical approach while continuing my research and development on the LLM. Thank you for bringing this to my attention.

aravindkarnam commented 3 months ago

@unclecode Sounds good! Please let me know if can be of any help in research or coding. Since this library is slowly turning into a critical piece in one of my projects, I'm more than happy to help.

unclecode commented 3 months ago

@aravindkarnam I really appreciate your support and enthusiasm! You are most welcome to help with this enhancement. I'd like to focus on statistical and lexical-level NLP approaches initially. If these work well, we'll integrate them into the current codebase and update the documentation accordingly. While I'm keeping an eye on lightweight image-to-text models for generating captions, I'd like to put that on hold for now to see how our initial approach pans out.

For our immediate next steps, I propose the following experiment:

Develop a function that, for each image element: a. Identifies the closest parent element. b. Extracts the text surrounding the image. c. If the extracted text doesn't meet a certain word count threshold, moves to the next parent element. d. Repeats this process until we have sufficient contextual text.
Apply this function to extract text and corresponding images from 50 different websites, creating a diverse dataset.
Write a script that uses a multimodal model (such as Claude-3,5 Sonnet, GPT-4o, LLAVA, or Moondream) to evaluate our extracted text. The script should: a. Pass both the image and the extracted text to the model. b. Ask the model to evaluate if the text is a good representative description of the image (return o or 1) c. Collect the model's assessment for each image-text pair, and finally calculate the accuracy.
Analyze the results to determine the accuracy of our approach.
If the results are promising, we'll use this method as a basis for caption generation and apply TF-IDF to extract keywords, making the information more interconnected with the rest of the page.
If the results aren't satisfactory, we'll analyze the problematic samples, gather insights, and refine our strategy accordingly.

So, would you be interested in taking on this experiment? If so, which parts would you like to work on? If you're open to it, you can handle the entire research to determine if this approach works well or not. Let me know. I'm happy to provide any necessary guidance or collaborate on specific aspects.

Let me know your thoughts on this approach and if you have any suggestions or modifications. I'm excited to see where this leads us!

aravindkarnam commented 3 months ago

@unclecode Sure. All the steps make sense. I'll take on this experiment. steps 1&2 sound straight forward. For 3&4 I haven't used a multimodal model in an API before, so let me give that a try and figure it out.

for 5: I conceptually understand TF-IDF but never really used it before. So I may need some help with that. Anyway let me start my work on this I'll update progress or blockers here.

unclecode commented 3 months ago

@aravindkarnam Fantastic, glad to have you onboard! No worries about 3 and 5. I'll share a few code snippets from our library to speed things up. To evaluate your extracted text, use the following code:

import anthropic
import json
import base64
from typing import Dict, Any

def get_evaluation_prompt(extracted_description: str) -> str:
    return f"""
    Analyze the given image and the provided description. The description was generated by an automated system.

    Description: {extracted_description}

    Please evaluate the quality and relevance of this description in relation to the image. 
    Return your evaluation as a JSON object with the following structure:
    {{
        "evaluation": string,  // One of: "relevant", "not_clear", "not_relevant"
        "score": number,  // A score from 0 to 100
        "explanation": string  // A short and concise explanation of your evaluation
    }}

    Ensure your response is wrapped in an XML tag called <result>. Make sure your JSON is valu and parseable.
    """

def evaluate_image_description(image_path: str, extracted_description: str) -> Dict[str, Any]:
    try:
        client = anthropic.Anthropic()

        with open(image_path, "rb") as image_file:
            image_data = base64.b64encode(image_file.read()).decode("utf-8")

        media_type = "image/jpeg" if image_path.lower().endswith(('.jpg', '.jpeg')) else "image/png"

        prompt = get_evaluation_prompt(extracted_description)

        response = client.messages.create(
            model="claude-3-5-sonnet-20240620",
            max_tokens=1024,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": media_type,
                                "data": image_data,
                            },
                        },
                        {
                            "type": "text",
                            "text": prompt
                        }
                    ],
                }
            ],
        )

        result_xml = " ".join([c.text for c in response.content if c.type == 'text'])

        # Extract content within <result> tags
        start_tag = "<result>"
        end_tag = "</result>"
        start_index = result_xml.find(start_tag)
        end_index = result_xml.find(end_tag)

        if start_index != -1 and end_index != -1:
            json_str = result_xml[start_index + len(start_tag):end_index].strip()
            result = json.loads(json_str)
            return result
        else:
            return {"error": "Could not find result tags in the response"}

    except FileNotFoundError:
        return {"error": "Image file not found"}
    except anthropic.APIError as e:
        return {"error": f"API error: {str(e)}"}
    except json.JSONDecodeError:
        return {"error": "Failed to parse JSON response"}
    except Exception as e:
        return {"error": f"Unexpected error: {str(e)}"}

# Example usage
result = evaluate_image_description("bird.jpeg", "This is a bird and tree.")
print(result)

You can keep experimenting and see how you can play around to get more often "relevant" results. You can also use the GPT-4 API. For TF-IDF, although we'll focus on it later, here is a sample code. You pass a query and a long text (in our case, the extracted markdown). It breaks the text into paragraphs and includes a third parameter to ensure we have rich paragraphs, especially in extracted markdown where some paragraphs might only have a few words. This will give better quality results.

from typing import List, Tuple
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def chunk_paragraphs(paragraphs: List[str], min_word_count: int) -> List[str]:
    """
    Merge paragraphs sequentially to ensure each chunk has at least the minimum word count.

    Args:
    paragraphs (List[str]): List of original paragraphs.
    min_word_count (int): Minimum number of words for each chunk.

    Returns:
    List[str]: List of paragraph chunks.
    """
    chunks = []
    current_chunk = []
    current_word_count = 0

    for paragraph in paragraphs:
        words = paragraph.split()
        if current_word_count + len(words) < min_word_count and current_chunk:
            current_chunk.append(paragraph)
            current_word_count += len(words)
        else:
            if current_chunk:
                chunks.append(' '.join(current_chunk))
            current_chunk = [paragraph]
            current_word_count = len(words)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

def sort_paragraphs_by_relevance(query: str, text: str, min_word_count: int = 50) -> List[Tuple[str, float]]:
    """
    Break text into paragraphs, chunk them based on minimum word count,
    calculate their relevance to the query using TF-IDF, and return chunks sorted by relevance.

    Args:
    query (str): The search query.
    text (str): The long text to be analyzed.
    min_word_count (int): Minimum number of words for each paragraph chunk.

    Returns:
    List[Tuple[str, float]]: A list of tuples containing paragraph chunks and their relevance scores,
                             sorted in descending order of relevance.
    """
    # Split text into paragraphs
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]

    # Chunk paragraphs based on minimum word count
    paragraph_chunks = chunk_paragraphs(paragraphs, min_word_count)

    # Add query to the list of documents for vectorization
    documents = [query] + paragraph_chunks

    # Create TF-IDF vectorizer and transform documents
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)

    # Calculate cosine similarity between query and each paragraph chunk
    query_vector = tfidf_matrix[0]
    chunk_vectors = tfidf_matrix[1:]
    similarities = cosine_similarity(query_vector, chunk_vectors).flatten()

    # Create list of (chunk, similarity) tuples
    chunk_similarities = list(zip(paragraph_chunks, similarities))

    # Sort chunks by similarity in descending order
    sorted_chunks = sorted(chunk_similarities, key=lambda x: x[1], reverse=True)

    return sorted_chunks

# Example usage
if __name__ == "__main__":
    sample_text = """
    Python is a high-level, interpreted programming language.
    It was created by Guido van Rossum and first released in 1991.

    Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
    Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.

    Python is dynamically typed and garbage-collected.
    It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming.

    Python is often described as a "batteries included" language due to its comprehensive standard library.
    It has a large and comprehensive standard library, which is one of Python's greatest strengths.

    Python's syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.
    The language provides constructs intended to enable clear programs on both a small and large scale.
    """

    query = "Python programming features"
    min_word_count = 30

    results = sort_paragraphs_by_relevance(query, sample_text, min_word_count)

    print(f"Paragraph chunks sorted by relevance to the query (min {min_word_count} words):")
    for chunk, score in results:
        print(f"\nRelevance score: {score:.4f}")
        print(f"Word count: {len(chunk.split())}")
        print(chunk)

Then you can start and keep me updated on the progress. I think this will be a great and useful feature.

aravindkarnam commented 3 months ago

@unclecode I've implemented step 1 and after looking at results had a couple of thoughts.

Not all images scraped from a webpage are equal. For example there are hero images, section images etc such as this, which contain useful information for RAG while others are merely icons on buttons and links. For example a button called "Take me there" may contain a right arrow icon, such as this. There's no value in either storing this image or getting this image transcribed with a model. The sheer number of these images I'm getting from crawling even very simple pages is quite high.
We should implement a very simple filtration strategy to weed out such useless images at very early stage. This will save some cost downstream as well when we are passing images to models. Here are some proposals I have on filtration: a) Create a cut-off for images based on height & width. For eg: eliminate images that are less than 100px in either height or width. We can expose this as a param to the user with a default value (similar to the word_count_threshold param) b) Eliminate based on parent elements. From what I've seen mostly such small images are part of elements like buttons and links that open up other pages/navigation elements.

If we use a combination of a & b we can clear a lot of clutter upfront. Your thoughts?

Text from nearest parent looks quite useful so far. I've only inspected it manually with 3-4 sites, I'll evaluate with scripts you shared shortly.

unclecode commented 3 months ago

@aravindkarnam You're absolutely spot on about the image filtering! It's super common to have loads of duplicate images across pages. So here's what I'm thinking: Let's create a scoring system for each image and chuck out anything below a certain threshold. Here's a quick rundown of what we could score on (for each item we score 1):

Size: Bigger usually means more important. (score 1 if width and height > 100)
File size: Heftier files tend to be meatier content. (score 1 if size > 10,000 bytes)
Filename/path: Avoid stuff with "icon" or "button" in the name.
HTML context: Skip images in buttons or with "icon" classes.
Alt text: Meaningful alt text usually means it's a content image.
Image format: Prioritize JPEGs and PNGs over SVGs and GIFs.
Position: Images higher up the page often matter more. (Look at the index of the image element and check against the total length of images; if their rate is above 0.5, score it)
Duplication: Keep only one copy of each unique image. (Create a simple hash by mixing width and height like f"{width}x{height}:{src}", then use that to avoid duplication—this is not the best way, but it's fast)

The key here is speed – we're dealing with potentially thousands of images, so every millisecond counts. For example, to get image sizes, use a HEAD request instead of GET to the image URL. It's way faster because you're not downloading the whole image, just the metadata ('Content-Length').

Wrap up all these and make this score function in the way it gets the "soup" object, base_url, and min_score (I feel 4 is good) and returns a filtered list of image elements. Then plug that into our main crawling function.

Let me know what you think or if you need any clarification. Excited to see what you come up with!

aravindkarnam commented 3 months ago

@unclecode Completed the scoring function and now the results are much more cleaner. Proceeding to step 3, ie getting the results evaluated with a model. Aligned on all factors for scoring except 6 and 8.

Image format: Prioritize JPEGs and PNGs over SVGs and GIFs.

SVGs are preferred over JPG and PNG by developers because it offers better resolution independence and animation. In fact off late lottiefiles are very popluar and used in landing pages which use SVG or canvas to render images, we should even plan to get those SVG images converted to a JPG or GIF and caption them in future. For example checkout these pages the hero image is an animated SVG (urbanpiper.com, memberstack.com)

f"{width}x{height}:{src}" Why don't we just do src, why do we have to prepend width and height. Already the image stored in src is just going to be the same file regardless. So we can just get the image captioned and stored.

Here are next steps that I'll plan

[ ] Get a good mix of 50+ sites, ranging from product landing pages, wikipedia, E-commerce, Documentation, knowledgebase etc. Filter the images and get the text from nearest parent and load it in a pandas DF.
[ ] Get the text rated by Multimodal AI model
[ ] Export the dataframe and upload here.

Then we can plan to tweak it further to improve the scores.

Have been a little busy yesterday and today. I'll complete this in next 2-3 days

unclecode commented 3 months ago

@aravindkarnam, Congratulations on completing steps 1 and 2 and moving on to step 3. Let me address some of your questions.

Regarding image formats, it's somewhat subjective. For programmers, SVGs are great due to their vector-based nature, allowing us to use and render them in any size easily. However, our goal is to extract images that carry meaning relevant to the articles or content on the page. Most images used in articles, products, news, etc., are in PNG or JPEG formats. That's why these formats are prioritized. SVGs or animations are typically used for icons or buttons, which we don't prioritize for extraction.

To accommodate various preferences, we should make these settings customizable (treat them like hyper parameters). For now, use config.py in the project's root to keep all constants, like the list of prioritized image formats or thresholds. Import from there. Eventually, we can convert these settings to a YAML file for better flexibility.

Your second question was about using width and height. Some servers and applications can return different versions of the same image based on specified width and height, meaning two images with the same source value but different dimensions might actually be different. We can decide to keep the smallest one or both, but this was the reason for including width and height. If there is no such differentiation, then the source alone is sufficient.

Lastly, I recently posted a tweet reviewing Florence-2, a new vision model, which I've quantized to 4 bits. It's quite impressive. I recommend using this model instead of OpenAI or GPT. You can refer to my tweet and the accompanying Colab notebook, which explains how to use and experiment with it. If it performs well, I can further optimize it for our projects in the second phase. This is link to the post https://x.com/unclecode/status/1805992326915108996?s=46&t=J1hebTqzIYxu8ZpV-7GoyQ

Well done so far, and I look forward to your next updates.

unclecode commented 3 months ago

@aravindkarnam Hope you are doing well, any update on your side?

aravindkarnam commented 3 months ago

@unclecode Ran into some health troubles over the weekend. Just recovered fully and got back to work today. I created a sample set of 25 URLs(sample.csv) with a good mix of sites. Ran into a lot of edge cases and exceptions with the ecommerce websites for the filtering and scoring functions. Fixed them all.

Now I ran the code I written for filtering, scoring and text extraction based on nearest eligible parent and have the result ready in pandas dataframe(Refer to data.csv. "desc" column is extracted text and "score" is calculated based on the approach we discussed above). Now I'll try to get Florence-2 model to evaluate the text I extracted from the nearest parent approach. I'll share my findings here.

I'm already committing in my forked repository. https://github.com/unclecode/crawl4ai/compare/main...aravindkarnam:crawl4ai:main. Please take a look at the code and let me know if any changes are to be made.

unclecode commented 3 months ago

@aravindkarnam I'm so sorry to hear that. I hope you're already recovering and doing well. I was a bit busy today and couldn't go through your changes, but they sound very intriguing. I'll check them tomorrow morning and update you. I'm glad you created a good sample set and are now ready to go for the Florence 2 model. I'll review everything tomorrow morning and update you. Thank you so much. Take care of yourself, please.

unclecode commented 2 months ago

@aravindkarnam I have just returned from a business trip and checked the data.csv file. The results are quite good; well done! It seems like the idea almost worked. I tested a few instances, and everything looks good. Let's move forward with Florence-2. I am very confident that we have found a good solution, and most importantly, a fast one. I have one suggestion regarding the code. The section where you filter out images without display and wrapped by a button, the three nested if statements bother me a bit. I have refactored your code, applied a bit separation of concerns and propose the following solution.

def is_valid_image(img, parent, parent_classes):
    style = img.get('style', '')
    src = img.get('src', '')
    classes_to_check = ['button', 'icon', 'logo']
    tags_to_check = ['button', 'input']

    return all([
        'display:none' not in style,
        src,
        not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
        parent.name not in tags_to_check
    ])

def process_image(img, url, index, total_images):
    if not is_valid_image(img, img.parent, img.parent.get('class', [])):
        return None

    score = score_image_for_usefulness(img, url, index, total_images)
    if score <= 2:
        return None

    return {
        'src': img.get('src', ''),
        'alt': img.get('alt', ''),
        'desc': find_closest_parent_with_useful_text(img),
        'score': score,
        'type': 'image'
    }

# Main processing
imgs = body.find_all('img')
media['images'] = [
    result for result in 
    (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
    if result is not None
]

The rest all looks fantastic, plz continue to finish up the Florence-2 then seems we are good to go. Good job!

aravindkarnam commented 2 months ago

Thanks @unclecode. I've been trying to run florence-2 on my mac book m2 and haven't been able to. It keeps asking me for some nvidia dependancies(cuda/nvcc) that I'm not able to find a mac version of anywhere. For now, I'm resigned to try this florence model on collab notebooks itself. So I'll have to export the results from my local environment into a csv and then run the tests on collab. Please let me know, if you figured out how to run this florence notebook on a mac.

The section where you filter out images without display and wrapped by a button, the three nested if statements bother me a bit.

I'll update this bit. Thanks for the suggestion.

unclecode commented 2 months ago

@aravindkarnam A few weeks ago, I created a Colab to review the quantized version. I quantized and uploaded the model to my HuggingFace repository. Please feel free to use it and determine if it provides better results. Ultimately, you can conduct your evaluation on Colab.

aravindkarnam commented 2 months ago

@unclecode I have tried your florence quantized version(8-bit model)unclecode/folrence-2-large-4bit. I've tried to load it directly with following code.

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("unclecode/folrence-2-large-4bit", trust_remote_code=True)

This has a dependancy on library flash_attn which further needs nvcc & CUDA in the environment, which currently is not possible with apple silicon. Both my macs have apple M2 chips.

I have been limited to use google colab since it had dependancy on GPU and CUDA either way (quantised or non-quantised model). I could only get sometime per day with T4 GPU backend, after that the kernels allocated doesn't have GPU and code errors out saying GPU is required. Therefore I couldn't proceed on this task as quickly as I wanted to. (I'm also not comfortable adding my card at this point either 😟, to get dedicated GPUs considering I'm new to this field and don't want to risk raking up huge bills unknowingly with a bad config).

Please let me know if there are any workarounds for this. (Maybe if you figured how to make these libraries work for apple M2 or there's some other free resources I could use)

Overall, I completed preliminary testing with florence. Here are my observations.

The results felt way way better with text scraped from immediate surroundings of image than model captioning.

For example the when scraped text says

Shaun Pollock. of. South Africa. bowls to. Michael Hussey. of. Australia. during the 2005. Boxing Day Test. match at the. Melbourne Cricket Ground."

the model captions it as

A group of people standing on top of lush green field.

In fact there's nothing wrong with model's caption that's exactly what's in the picture, but there's a cricket match going on there, which you will know only if you read the surrounding text and understood cricket.

Similarly another example from product landing pages the scraped text is

Cut Your Reading Time in Half. Let Speechify Read to You.. Gwyneth Paltrow. English Female Voice. Snoop Dogg. English Male Voice. John. English Male Voice. Mr. Beast.

whereas the model's caption is

phone with a message on the screen next to a phone with an image of a person on it

Another example from wikipedia scraped text is

Endeavour. docked at ISS during the STS-134 mission

and model caption is

image of a satellite in the sky with a building in the background"

An example from ecommerce, scraped text is

Dreo Humidifiers for Bedroom, Top Fill 4L Supersized Cool Mist Humidifier with Oil

and the model caption is

a glass jar filled with water sitting on top of a table

In almost all cases the model's caption would be right/true if you look at the picture, but there's way more going on in the picture and page context is required to truly understand what's in the picture. I've played around with prompt as well, but it doesn't seem to be possible for the model to clearly describe what's going on in the picture based on the picture itself.

Please let me know how do you advise to further proceed on this? We can either stop at this point or further refine the scraped text by passing it to model along with the image.

PS. I also made changes in this commit as per your suggestion. If you can create a new branch for this feature, I can raise a PR for code I have so far.

unclecode commented 2 months ago

@aravindkarnam, I've been away for a few days. I checked the details and agree we gather much better info than those models, and we do it much faster. Such a great result and experience, isn't it? We've had a complete professional/academic journey together. I created the branch "main-img-captionify," sent a pull request over there, and tested it. Excited to see it!

aravindkarnam commented 2 months ago

@unclecode Raised a PR ☝🏽. Learned a lot of stuff, while working on this enhancement. Looking forward to more such collabs.

unclecode commented 2 months ago

@aravindkarnam This is a great job; I'm currently bypassing image size checks, which slow things down, yet the scoring remains effective. I will merge this into the main branch and release new version 0.2.8, giving you full credit for your contribution. I'm also working on integrating PDF functionality without third-party libraries, and the results are promising. Another upcoming task is using multiprocessing to crawl multiple URLs, which will support a depth-first search approach for deep crawling. Initially, we'll fetch the sitemap and create one if it's unavailable, then crawl the entire page. I think this is a good area for collaboration if you're interested.

aravindkarnam commented 2 months ago

Another upcoming task is using multiprocessing to crawl multiple URLs, which will support a depth-first search approach for deep crawling. Initially, we'll fetch the sitemap and create one if it's unavailable, then crawl the entire page. I think this is a good area for collaboration if you're interested.

I'm interested unclecode. I was a developer who turned into product management career for past few years, who's now picking up on coding again as I'm trying to build my own product as a solo founder. I believe this is the best way to build back my coding chops. So count me in for any exciting new enhancements your'e planning.

Also I suggest that we start a discord server for crawl4ai, atleast for collaborators and selected users so that we can communicate more efficiently on bugs and enhancements.

unclecode commented 2 months ago

@aravindkarnam It's nice to know your back story and understand the feeling. Being in product management must have given you some benefits in understanding different perspectives. Often, solo developers might not sense the broader picture, and project managers who aren't developers miss the developer's side. I'm curious to see how it goes for you with both experiences.

Let's get into Discord. I'll create a server and add you as a moderator. Also, I'm handling the pull request you sent.

There are certain things I want to bring to Crawl4AI. We can discuss deep crawling, creating snapshots of all website pages, and generating quick summaries. This will help create a semantic index for websites, useful for in-context learning for any large language model.

I'm also working on an interesting PDF library that I want to open source and bring to Crawl4AI. There's nothing more rewarding than building products that you also use. Good entrepreneurs often become users of their own products.

I'm an entrepreneur running a tech business in Southeast Asia for the last 8 years, with different divisions, offices, and a big team. Now, I have a bit more time, almost like early retirement. What I want to do is focus more on my computer scientist and researcher side, after these years, and it's so fun for me now working in the AI area. One side is relevant to my businesses, but the other side is about contributing more to open source, democratizing it, and working with people from other generations like yourself. So, you are most welcome.

I already have multiple projects that I typically tweet about and share on my Twitter/X account. I'll bring them to this Discord as well, and if you want to contribute, that would be great. These projects are somehow connected with each other too. So that's it. Let's build together.

unclecode / crawl4ai