The open-source repository mev.fyi aggregates research on Maximal Extractable Value (MEV). Explore curated academic papers, community contributions, and educational content on MEV and related topics.
MIT License
63
stars
4
forks
source link
[feature] Crawl all non-medium websites to fetch all articles #22
Input: website URLs. Output: dict of website mapping to all the articles' URLs for that website (with pagination)
Approach: have a general script to which you pass config items for each website.
Work in progress:
Fix pagination
Make sure it works for all websites namely the config skeleton might need to be updated
If there are no articles in the website, first try go to the said website, find if there are other URLs available (in other indexes like /technology or /writing [...])
If there are new websites and the config already exists, add the empty config items to the existing config file
If there are new websites available e.g. a /technology while we added the /writing, then append this /techonology in to_parse.csv
Challenges:
Make sure it works for pagination
Make code general and robust. Abstract all the complexity into the config items. We can expect several containers, each with their own selectors, for each site
End goal:
get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
Expected cost: 2-3 hours to reach >50% of websites covered. Challenges: possible numerous updates to the config format.
FAQ
Task: Obtain a list of all article urls for each website
classes are not important as long as the file works
input: called by cli with no arguments
output: dict of mapping of from website to lists to all article links
how does the code obtain the list of articles to crawl? -> implemented with the config file generated from websites.csv. Namely now all that is needed is to update the selectors for each website
the articles in medium should NOT be crawled because valmeylan is working on it
how do I know which websited should NOT be crawled because they only have one article?
modify an existing file created recently by valmeylan
TODO
data/links/websites.csv
. Visualize websites at data.mev.fyi on Websites tab.Challenges:
End goal:
FAQ
Task: Obtain a list of all article urls for each website
websites.csv
. Namely now all that is needed is to update the selectors for each website