philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.14k stars 32 forks source link

cache: failed to insert cache key #48

Open dynabler opened 8 months ago

dynabler commented 8 months ago

Got this error when running Flyscrape

cache: failed to insert cache key "GET https://example.com/shoes": UNIQUE constraint failed: cache.key

Error was caused by a change in url. When writing the script it was https://example.com/shoes, but in the meantime it changed to https://example.com/amazing-shoes

Perhaps an idea to have it output a more clear message like that "WARNING": "Forgot to call text(), html() or attr()?"

philippta commented 8 months ago

Unfortunately I can not replicate this issue. Could you add a bit more detail or a possible script that produces this?

In any case, the message you're seeing is only a warning, so it does not interrupt the scraping process.

dynabler commented 8 months ago

Unfortunately I can not replicate this issue. Could you add a bit more detail or a possible script that produces this?

It's a very difficult warning to reproduce, because it only occurs when 2 specific things happen: Original URL is changed AND it redirects to the new URL. If original URL is changed but does not redirect, flyscrape just stops because of 404. I will give an example:

export const config = {
  urls: [
    ...range("https://www.example.com/83743/category/amazing-movies.html?page={}", 1, 29),
  ], // this is the url as of writing of the script
  follow: [
    ".item-title > a",
  ],
  cache: "file",
  depth: 1,
  rate: 60,
  output: {
    file: "amazing_movies.json",
    format: "json"
  },
  headers: {
    "User-Agent":""
  }
};

function range(url, from, to) {
  return Array.from({length: to - from + 1}).map((_, i) => url.replace("{}", i + from));
}

export default function({ doc, absoluteURL }) {
    const title = doc.find('h1');
    const price = doc.find('.product_main > .price_color')
    const stock = doc.find('.product_main > .availability')

    return {
        title: title.text(),
        price: price.text(),
        stock: stock.text().trim()
    }
}

A few days later the URL changed AND it redirected to the new page

export const config = {
  urls: [
    ...range("https://www.example.com/83743/category/stunning-movies.html?page={}", 1, 29),
  ], // this is the url changed and it redirects the old /amazing-movies/ to /stunning-movies/
  follow: [
    ".item-title > a",
  ],
  cache: "file",
  depth: 1,
  rate: 60,
  output: {
    file: "amazing-movies.json",
    format: "json"
  },
  headers: {
    "User-Agent":""
  }
};

function range(url, from, to) {
  return Array.from({length: to - from + 1}).map((_, i) => url.replace("{}", i + from));
}

export default function({ doc, absoluteURL }) {
    const title = doc.find('h1');
    const price = doc.find('.product_main > .price_color')
    const stock = doc.find('.product_main > .availability')

    return {
        title: title.text(),
        price: price.text(),
        stock: stock.text().trim()
    }
}

In any case, the message you're seeing is only a warning, so it does not interrupt the scraping process.

Confirmed. It doesn't interrupt the scraping process. But it keep going, scraping null. I was thinking something like this #34, where 301 and 302 (and maybe 307) are detected and warn user.