Closed ZJULiHongxin closed 7 months ago
We first randomly select several cdx files from Common Crawl to get the candidate URL sets (more than 50M URLs). Since these URLs contain a large number of duplicate top-level domains, to ensure the diversity of websites and layouts, we keep only one URL for a top-level domain. For example, if www.google.com/news
has been selected, URLs like www.google.com/shopping
will not included. Finally, we got about 300k URLs with different top-level domains. Beyond the above, we didn't check for popularity and diversity of the web pages.
Let me know if you have any other questions.
Got it. Thank you very much for your reply! @njucckevin
@njucckevin Hello! Thank you for open-sourcing this great work.
I'm a bit curious about how you collected the training data samples from public data sources.
The paper mentions "We collect approximately 300k web pages from the latest Common Crawl repository to serve as our training data for web UI.". Could you please provide some hints about how you selected the ~300k pages from Common Crawl? Did you consider the popularity or complexity of the selected web pages?
I would appreciate it if you could provide some hints. Thanks!