mrjleo / boilernet

Boilerplate Removal using Deep Learning
MIT License
82 stars 18 forks source link

Could I get the URLs of the dataset in 'GoogleTrend-2017'? #11

Closed dreamwayjgs closed 3 years ago

dreamwayjgs commented 3 years ago

Hello! I'm trying to make a mhtml version (due to css data) of Google dataset. So, I need to saving mhtml from each pages again, but I cannot find a list of urls.

Also, I'd like to create more Google dataset (GoogleTrend-2019, 2020, ...). In particular, after COVID-19.

In Section 4.1 'Dataset Prepration' (from the paper), it says:

We obtained the HTML files by retrieving the first 100 results for each trending Google query from the year 2017. From the resulting pool of websites we randomly sampled a set of 180 documents and annotated them.

It means that the GoogleTrends page provides over 100 queries, but the page shows a dozen of categories and the 'searches' category got only 10 items. So, I'm confusing that the 'first 100 result' is. I guess:

  1. GoogleTrends page changed since then, and there was over 100 queries in 'searches' category, but not now.
  2. It is the first 100 regardless of category (Searches, Peoples, and so on).

Which one is right? And How can I get a list of urls of the dataset?

mrjleo commented 3 years ago

Hey,

I'm trying to make a mhtml version (due to css data) of Google dataset. So, I need to saving mhtml from each pages again, but I cannot find a list of urls.

unfortunately, the list of URLs is not available anymore. Sorry about that :(

It means that the GoogleTrends page provides over 100 queries, but the page shows a dozen of categories and the 'searches' category got only 10 items. So, I'm confusing that the 'first 100 result' is. I guess:

  1. GoogleTrends page changed since then, and there was over 100 queries in 'searches' category, but not now.
  2. It is the first 100 regardless of category (Searches, Peoples, and so on). Which one is right? And How can I get a list of urls of the dataset?

Actually, this refers to the top-100 results, i.e. hits from Google, for each query. The trends page looks unchanged to me. If I remember correctly, we took the first query/queries from each category, retrieved the top-100 results using the Google API and then sampled from those pages.

dreamwayjgs commented 3 years ago

Oh, results, not a query. I got it. Thank you! Much obliged.