What are those rationales for collecting the training datasets?

ZJULiHongxin commented 7 months ago

@njucckevin Hello! Thank you for open-sourcing this great work.

I'm a bit curious about how you collected the training data samples from public data sources.

The paper mentions "We collect approximately 300k web pages from the latest Common Crawl repository to serve as our training data for web UI.". Could you please provide some hints about how you selected the ~300k pages from Common Crawl? Did you consider the popularity or complexity of the selected web pages?

I would appreciate it if you could provide some hints. Thanks!

njucckevin commented 7 months ago

We first randomly select several cdx files from Common Crawl to get the candidate URL sets (more than 50M URLs). Since these URLs contain a large number of duplicate top-level domains, to ensure the diversity of websites and layouts, we keep only one URL for a top-level domain. For example, if www.google.com/news has been selected, URLs like www.google.com/shopping will not included. Finally, we got about 300k URLs with different top-level domains. Beyond the above, we didn't check for popularity and diversity of the web pages. Let me know if you have any other questions.

ZJULiHongxin commented 7 months ago

Got it. Thank you very much for your reply! @njucckevin

njucckevin / SeeClick

What are those rationales for collecting the training datasets? #11