philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.04k stars 31 forks source link

Expose SQLite3 API #16

Closed grimnight closed 8 months ago

grimnight commented 11 months ago

Flyscrape already comes with SQLite3 built-in, so it would be possible to expose an API so results could be stored to a SQLite3 database, instead of neing printed to stdout.

If an API gets added, I don't know whether it would be better to have a getenv function to get environment variables to create a path to the database file or would it be better to have a stdpath variable to point to $XDG_DATA_HOME/flyscrape. But I think one or the other would be necessary.

Thank you for making Flyscrape, it looks great.

philippta commented 11 months ago

I like the idea of being able to store results in a SQLIite database. However I would prefer a more generalized approach and not expose db APIs to the script.

Something I had in mind was to set the output format as part of the configuration. E.g.:

export const config = {
    output: {
        type: "sqlite", // or "json", "ndjson", "csv"
        file: "/path/to/db",
        sqlite: {
            // sqlite specific options
            table: "results",
        },
    }
}

I am still unsure what the table should look like as the scraped data could be deeply nested or not.

grimnight commented 11 months ago

You could add output type stdout (that being the default), should there be support for multiple outputs at the same time.

You could infer the keys from top level keys and add the rest of the information as JSON, SQLite haspretty good JSON support these days, including operators ->, ->>. It could also be possible for users to write their own schema in the configuration.

These are just my suggestions.

philippta commented 11 months ago

should there be support for multiple outputs at the same time.

I'm not sure about supporting multiple outputs at the same time. Do you have a compelling reason why someone would want that?

You could infer the keys from top level keys and add the rest of the information as JSON, SQLite haspretty good JSON support these days, including operators ->, ->>. It could also be possible for users to write their own schema in the configuration.

Good suggestion. I think this is the way.

grimnight commented 11 months ago

I'm not sure about supporting multiple outputs at the same time. Do you have a compelling reason why someone would want that?

Compelling reason, no, but keep in mind that tools such as tee exist, so someone might find it useful.

philippta commented 8 months ago

Closing because of inactivity. If this is a more widely requested feature we can open it up again under a new enhacement issue.

alucab commented 8 months ago

JSON, CSV, NDJSON would be very useful options for massive scrapes SQLite would be also a good option as it allows to manage data in a compact form and plenty of tools are available

choice between stdout and file too, possibly rotating.

I know that several of this thing are available as external tools but the value of flyscrape is for me the compactness

alucab commented 8 months ago

In addition consider that JSON is not the best format for very large volumes of data. I would consider at least a transition to NDJSON

philippta commented 8 months ago

@alucab Thanks for the reminder. It looks like I forgot to add the output options to the sample configuration.

You can already configure output files and chose between json and ndjson today.

export const config = { 
    // Specify the output options.
    output: {
        // Specify the output file.                        (default = stdout)
        file: "results.json",

        // Specify the output format.                      (default = json)
        // Options: "json" | "ndjson"
        format: "json",
    },
}

You can also specify these with a parameter like so:

# Set the output format to ndjson.
$ flyscrape run example.js --output.format ndjson

# Write the output to a file.
$ flyscrape run example.js --output.file results.json

Supporting CSV and SQLite is not that tivial considering the inherently nested data. There have to be a lot of decisions and trade offs to be made and I'm unsure what would be the right ones. Let me present a couple of JSON examples and we can figue out what they could look like in CSV/SQLite, from easy to hard:

Easy:

{
  "url": "example.com",
  "data": {
    "title": "Example",
    "body": "Lorem ipsum."
  },
}
url,title,body
example.com,Example,Lorem ipsum.

Medium:

{
  "url": "example.com/page-1",
  "data": [
    {
      "name": "Product A",
      "price": 10.50
    },
    {
      "name": "Product B",
      "price": 4.20
    }
  ]
}

Option 1

Expand top level arrays (values of "data") into individual rows.

url,name,price
example.com/page-1,Product A,10.50
example.com/page-1,Product B,4.20

Option 2

Treat it as JSON if the properties can't be mapped to columns.

url,data
example.com,"[{""name"":""Product A"",""price"":10.50},{""name"":""Product B"",""price"":4.20}]"

Hard:

{
  "url": "example.com/food",
  "data": {
    "title": "Food",
    "categories": [
      "Beverages",
      "Snacks",
      "Meals"
    ],
    "products": [
      {
        "name": "Coca Cola",
        "price": 1.50
      },
      {
        "name": "Fanta Orange",
        "price": 1.50
      }
    ]
  }
}
url,title,categories,products
example.com/food,Food,"[""Beverages"",""Snacks"",""Meals""]","[{""name"":"Coca Cola"",""price"":1.50.....

Impossible:

Given a extraction function like so:

export default function({ doc }) {
    const pagetype = doc.find("body").attr("data-page-type")

    if (pagetype == "category") {
        return extractCategoryPage(doc);
    } else if (pagetype == "product-detail") {
        return extractProductDetailPage(doc);
    }
}

That could give a JSON like so:

[
  {
    "url": "example.com/food",
    "data": {
      "pagetype": "Category",
      "products": [
        "Coca Cola",
        "Fanta Orange"
      ]
    }
  },
  {
    "url": "example.com/food",
    "data": {
      "pagetype": "Product Detail",
      "title": "Coca Cola",
      "price": 1.50,
      "description": "Very tasty"
    }
  }
]

Two problems with this:

If anyone has clever solutions to these problems I'm happy to hear them.

alucab commented 8 months ago

@philippta great job

i agree with you that if you have ndjson and files you are good !