mev-fyi / data

The open-source repository mev.fyi aggregates research on Maximal Extractable Value (MEV). Explore curated academic papers, community contributions, and educational content on MEV and related topics.
MIT License
62 stars 4 forks source link

[feature] extract all unique blog websites from articles #15

Open vmeylan opened 7 months ago

vmeylan commented 7 months ago

TODO

Example:

https://ethresear.ch/t/burning-mev-through-block-proposer-auctions/14029 -> https://ethresear.ch/t/ https://taiko.mirror.xyz/7dfMydX1FqEx9_sOvhRt3V8hJksKSIWjzhCVu7FyMZ -> https://taiko.mirror.xyz/ https://figmentcapital.medium.com/the-proof-supply-chain-be6a6a884eff -> https://figmentcapital.medium.com/

Helper:

The regexp hashmap url_patterns which identifies whether the link refers directly to an article, or its website e.g. the authors' blog post, available in https://github.com/mev-fyi/data/blob/main/src/populate_csv_files/parse_new_data.py

Challenge:

There can be several matches e.g. some medium authors' blog post are in the format <author>.medium.com/<article> while others are in the format www.medium.com/<author>

End goal:

get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.