GuoFan1996 commented 5 months ago

Integrate Puppeteer for Dynamic Content Extraction

Description:

This PR integrates Puppeteer into our content extraction feature, enhancing our ability to handle dynamic content from platforms like YouTube, which was previously inaccessible through static HTML parsing. This upgrade significantly broadens our data retrieval capabilities to include dynamically loaded content.

Major Changes:

Puppeteer Integration: Incorporation of puppeteer allows interaction with JavaScript-reliant web pages, opening up a wider array of content extraction possibilities.
Dynamic Transcript Extraction: A new function, extractYoutubeTranscript, leverages Puppeteer to effectively fetch and extract YouTube transcripts, overcoming our dynamic content extraction hurdles.
Configuration and Security: Utilizes environment variables for configuration, including PUPPETEER_BROWSERLESS_IO_KEY, enhancing security and deployment flexibility.

Before Serve and Test:

Visit Browserless.io to generate an API key.
Store the API key in your .env file as follows: PUPPETEER_BROWSERLESS_IO_KEY=your-browserless-api-key. This step ensures secure access to Browserless services for Puppeteer operations.

Serve and Test the Function:

Serving the Edge Function Locally:

To serve the Edge Function locally for testing and development, use the Supabase CLI with the following command:

supabase functions serve extractContent

Testing with `curl`:

curl -i --location --request POST 'http://127.0.0.1:54321/functions/v1/extractContent' \
    --header 'Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZS1kZW1vIiwicm9sZSI6ImFub24iLCJleHAiOjE5ODM4MTI5OTZ9.CRXP1A7WOeoJeXxjNni43kdQwgnWNReilDMblYTn_I0' \
    --header 'Content-Type: application/json' \
    --data '{"name":"extractContent","url": "https://example.com"}'

Replace https://example.com with the url you want to extract content.

This PR resolves #24.

inhwaS commented 5 months ago

Visit Browserless.io to generate an API key, for this part, did you choose some free trial for a week? if that's the case, we should widely register our account to keep it as free

GuoFan1996 commented 5 months ago

Confirmed that without Browserless.io key, it still works successfully without any error. Browserless.io key is only for youtube link. Other links work even though you don't set the browserless key.

GuoFan1996 commented 5 months ago

Visit Browserless.io to generate an API key, for this part, did you choose some free trial for a week? if that's the case, we should widely register our account to keep it as free

Yes, using browserless free trial is feasible. I just tried it, fixed a little bug and committed it. You can refer my last commit.

mlim-usfca / PersonalKnowledge

feat: update extractContent to allow auth check and youtube transcrip… #25

Integrate Puppeteer for Dynamic Content Extraction

Description:

Major Changes:

Before Serve and Test:

Serve and Test the Function:

Serving the Edge Function Locally:

Testing with `curl`:

mlim-usfca / PersonalKnowledge

feat: update extractContent to allow auth check and youtube transcrip… #25

Integrate Puppeteer for Dynamic Content Extraction

Description:

Major Changes:

Before Serve and Test:

Serve and Test the Function:

Serving the Edge Function Locally:

Testing with curl:

Testing with `curl`: