Open rmusser01 opened 4 months ago
Add hashing of ingested article content to identify changes made between scrapes
Modification of headless browser for injection of cookies/user:pass/plugins
txt files stored in a git repo: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/git.py
Issue to track improvements/ideas for URL Scraping & Ingestion
Seems like I can possibly skip all this if I use: https://github.com/ArchiveBox/ArchiveBox/wiki + https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#archive-method-toggles And then use it via the cli to archive the page, and then extract the text using Trafilatura and do modifications on the data from there.
Browser Plugin
Ingestion
Replicating structure using markdown https://github.com/AnswerDotAI/web2md/tree/main
Scraping:
Spoofing client
Storage: https://github.com/iansinnott/full-text-tabs-forever
async https://jacobpadilla.com/articles/recreating-asyncio https://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html
Site-Specific stuff