rahular / varta

https://arxiv.org/abs/2305.05858
Apache License 2.0
8 stars 0 forks source link

Time required to download 41M data #3

Closed sbmaruf closed 1 year ago

sbmaruf commented 1 year ago

The source website is very susceptible to multiple API requests at the same time. I was wondering,

  1. What is the exact number of these variables in the settings file while you scraped the data,
    CONCURRENT_REQUESTS = ?
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = ?
    # The download delay setting will honor only one of:
    CONCURRENT_REQUESTS_PER_DOMAIN = ?
    CONCURRENT_REQUESTS_PER_IP = ?
  2. How much time did it take to scrape the full data (41M samples)?
  3. Is there anyway you can share the scrapped data? In that way, the source website won't have to respond to millions of API requests.
rahular commented 1 year ago

The full data is now available: https://huggingface.co/datasets/rahular/varta