The open-source repository mev.fyi aggregates research on Maximal Extractable Value (MEV). Explore curated academic papers, community contributions, and educational content on MEV and related topics.
MIT License
62
stars
4
forks
source link
[feature] extract all unique blog websites from articles #15
There can be several matches e.g. some medium authors' blog post are in the format <author>.medium.com/<article> while others are in the format www.medium.com/<author>
End goal:
get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
TODO
article
header).Example:
https://ethresear.ch/t/burning-mev-through-block-proposer-auctions/14029
->https://ethresear.ch/t/
https://taiko.mirror.xyz/7dfMydX1FqEx9_sOvhRt3V8hJksKSIWjzhCVu7FyMZ
->https://taiko.mirror.xyz/
https://figmentcapital.medium.com/the-proof-supply-chain-be6a6a884eff
->https://figmentcapital.medium.com/
Helper:
The regexp hashmap
url_patterns
which identifies whether the link refers directly to an article, or its website e.g. the authors' blog post, available in https://github.com/mev-fyi/data/blob/main/src/populate_csv_files/parse_new_data.pyChallenge:
There can be several matches e.g. some medium authors' blog post are in the format
<author>.medium.com/<article>
while others are in the formatwww.medium.com/<author>
End goal:
get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.