rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Like an open Source NotebookLM
Apache License 2.0
166 stars 5 forks source link

Improvement: Improve URL Scraping/Ingestion #54

Open rmusser01 opened 4 months ago

rmusser01 commented 4 months ago

Issue to track improvements/ideas for URL Scraping & Ingestion

Seems like I can possibly skip all this if I use: https://github.com/ArchiveBox/ArchiveBox/wiki + https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#archive-method-toggles And then use it via the cli to archive the page, and then extract the text using Trafilatura and do modifications on the data from there.

Browser Plugin

Ingestion

Replicating structure using markdown https://github.com/AnswerDotAI/web2md/tree/main

Scraping:

Spoofing client

Storage: https://github.com/iansinnott/full-text-tabs-forever

async https://jacobpadilla.com/articles/recreating-asyncio https://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html

Site-Specific stuff

rmusser01 commented 4 months ago

Add hashing of ingested article content to identify changes made between scrapes

rmusser01 commented 4 months ago

Modification of headless browser for injection of cookies/user:pass/plugins

rmusser01 commented 3 weeks ago

https://github.com/muchdogesec/history4feed

rmusser01 commented 3 weeks ago

Confluence scraping: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/confluence.py

rmusser01 commented 3 weeks ago

HTML https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/html.py

rmusser01 commented 3 weeks ago

Gitbook: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/gitbook.py

rmusser01 commented 3 weeks ago

txt files stored in a git repo: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/git.py

rmusser01 commented 1 week ago

https://github.com/TheBlewish/Web-LLM-Assistant-Llama-cpp

rmusser01 commented 4 days ago

https://github.com/paul-gauthier/aider/blob/main/aider/scrape.py