Improvement: Improve URL Scraping/Ingestion - Githubissues

rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Like an open Source NotebookLM

Apache License 2.0

166 stars 5 forks source link

Improvement: Improve URL Scraping/Ingestion #54

Open rmusser01 opened 4 months ago

rmusser01 commented 4 months ago

Issue to track improvements/ideas for URL Scraping & Ingestion

[ ] Add custom cookie support
[ ] Instructions for adding custom browser-addons to the scraping browser
[ ] Support for identifying article Title/Author name(s)
[ ] Support for identifying article publish date

Seems like I can possibly skip all this if I use: https://github.com/ArchiveBox/ArchiveBox/wiki + https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#archive-method-toggles And then use it via the cli to archive the page, and then extract the text using Trafilatura and do modifications on the data from there.

Browser Plugin

https://github.com/deathau/markdownload

Ingestion

https://trafilatura.readthedocs.io/en/latest/troubleshooting.html#beyond-raw-html

Replicating structure using markdown https://github.com/AnswerDotAI/web2md/tree/main

Scraping:

Spoofing client

Storage: https://github.com/iansinnott/full-text-tabs-forever

async https://jacobpadilla.com/articles/recreating-asyncio https://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html

Site-Specific stuff

Reddit: https://github.com/JosefAlbers/rd2md

rmusser01 commented 4 months ago

Add hashing of ingested article content to identify changes made between scrapes

rmusser01 commented 4 months ago

Modification of headless browser for injection of cookies/user:pass/plugins

rmusser01 commented 3 weeks ago

https://github.com/muchdogesec/history4feed

rmusser01 commented 3 weeks ago

Confluence scraping: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/confluence.py

rmusser01 commented 3 weeks ago

HTML https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/html.py

rmusser01 commented 3 weeks ago

Gitbook: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/gitbook.py

rmusser01 commented 3 weeks ago

txt files stored in a git repo: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/git.py

rmusser01 commented 1 week ago

https://github.com/TheBlewish/Web-LLM-Assistant-Llama-cpp

rmusser01 commented 4 days ago

https://github.com/paul-gauthier/aider/blob/main/aider/scrape.py