philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.02k stars 29 forks source link

Unable to scrape some site, even with headers - "Enable JavaScript and cookies to continue" #27

Closed foloey closed 7 months ago

foloey commented 8 months ago

code example:

export const config = {
  url: "https://www.onlyluve.com/",
  headers: {
    "accept": "*/*",
    "accept-language": "zh-CN,zh;q=0.9",
    "content-type": "application/json",
    "sec-ch-ua": "\"Not_A Brand\";v=\"8\", \"Chromium\";v=\"120\", \"Google Chrome\";v=\"120\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Linux\"",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "sentry-trace": "f38b3e0798eb4bfc881bd4280c3317dd-bf7c17b9f0440896-1",
    "cookie": "n_u=765d84ff2e30f99437b93f1d8e010d12; f_ds_info=9qCtGm7tMKXrls/rpZy8ZTpzmpEEr7UgzcZ7AEngsur+KO8uiQzk2R1qRjHDyW5SYEZKj5kOO+OxU5eNBLt8HQ==; f_ds_info.sig=hSVHkomBcQ49RyOZqiaxWlWz0f6JRWWcd2OflhQEaOg; store_id=1644388412791; store_id.sig=SihCPk1oHJZcfOXIwR981spewYBTXy7dPXV4G3nBVEs; merchant_id=2000480784; merchant_id.sig=5ekNGZYgcZQisK1T6FQLVzzUSrcP0s0_YeKlqCfvE4Y; currency_code=USD; currency_code.sig=nEGddW1-E-8oJfI_Pm_5XNzC2sMi1n3aVzZ3v01csyY; localization=US; lang=en; lang.sig=HPZEXM6qRQA3fl9QF0Gl5KM_KZ7FwUtDpVV9UEUrrek; addressLang=en; addressLang.sig=fZhLaUxh_564Gt_Ygb8agf56cVb1lYYp6NMpk7wfgaM; userSelectLocale=en; userSelectLocale.sig=xaWhkiDLccJKOWtBx98z0KVVx7o_iP0WoEYPBrEqJCw; store_block_region_status=0; currency_code_userSetting=USD; currency_code_userSetting.sig=wreMdGqvcOcZfYXi-Fd1QDxl5OWoQm3s2QLyXkCpvxE; n_sess={\"session_id\":\"61bc25fc-95d9-4589-ac10-7d93c942e8fd\",\"created_at\":1703331616495,\"last_session_id\":\"\",\"session_create_type\":101}; _tracking_consent=%7B%22con%22%3A%7B%22GDPR%22%3A%22%22%7D%2C%22v%22%3A%221.0%22%2C%22lim%22%3A%5B%5D%2C%22reg%22%3A%22%22%7D; __cf_bm=JubuiVbchbq_eSuRevaeVRnd_k88MfnzhCsI0OX4YoQ-1703331616-1-AR5ysAgK3XgxsQkSAlVkcV/TTI/qUsZgecEDBuMldNNzsGyilF1VhlLX0dtjQ2KW/rXeqtLnYMSuHQLjWirtApc=; log_session_id=4050ebff-d155-434e-97d7-868fce33b989; lp_url={%22landingPageHtml%22:%22https://www.onlyluve.com/%22%2C%22occurredAt%22:1703331618674}; s_id=786D5D83CD1C0F48302A4480E427BB4D; s_id.sig=d278c6161c688c7cd1a83c98250419d9; t_cart=55a8716e8a6e4cde948ca2abf553918b; t_cart.sig=ccfb118238dc3e4e47036cb310d19188",
    "Referer": "https://www.onlyluve.com/products/mens-luxury-suede-leather-fur-coat",
    "Referrer-Policy": "strict-origin-when-cross-origin"
  }
}

export default function({ doc, absoluteURL }) {
  const title = doc.find("title");

  return {
    title: title.text(),
    document: doc.html(),
  };
}

and the result is something like <span id=\"challenge-error-text\">Enable JavaScript and cookies to continue</span>

philippta commented 8 months ago

Hey, it looks like the site you’re trying to scrape is protected by Cloudflare‘s bot protection, thus not allowing you to access it.

philippta commented 8 months ago

Hey @foloey I am currently working on a hosted proxy service that can render these websites in Chrome. It would also allow you to access this protected website. You can keep all of your script, you only have to set it as the proxy.

Let me know if you are still interested and I can share the details.

philippta commented 7 months ago

Closing because of inactivity.