Closed grimnight closed 8 months ago
I like the idea of being able to store results in a SQLIite database. However I would prefer a more generalized approach and not expose db APIs to the script.
Something I had in mind was to set the output format as part of the configuration. E.g.:
export const config = {
output: {
type: "sqlite", // or "json", "ndjson", "csv"
file: "/path/to/db",
sqlite: {
// sqlite specific options
table: "results",
},
}
}
I am still unsure what the table should look like as the scraped data could be deeply nested or not.
You could add output type stdout (that being the default), should there be support for multiple outputs at the same time.
You could infer the keys from top level keys and add the rest of the information as JSON, SQLite haspretty good JSON support these days, including operators ->
, ->>
. It could also be possible for users to write their own schema in the configuration.
These are just my suggestions.
should there be support for multiple outputs at the same time.
I'm not sure about supporting multiple outputs at the same time. Do you have a compelling reason why someone would want that?
You could infer the keys from top level keys and add the rest of the information as JSON, SQLite haspretty good JSON support these days, including operators ->, ->>. It could also be possible for users to write their own schema in the configuration.
Good suggestion. I think this is the way.
I'm not sure about supporting multiple outputs at the same time. Do you have a compelling reason why someone would want that?
Compelling reason, no, but keep in mind that tools such as tee exist, so someone might find it useful.
Closing because of inactivity. If this is a more widely requested feature we can open it up again under a new enhacement issue.
JSON, CSV, NDJSON would be very useful options for massive scrapes SQLite would be also a good option as it allows to manage data in a compact form and plenty of tools are available
choice between stdout and file too, possibly rotating.
I know that several of this thing are available as external tools but the value of flyscrape is for me the compactness
In addition consider that JSON is not the best format for very large volumes of data. I would consider at least a transition to NDJSON
@alucab Thanks for the reminder. It looks like I forgot to add the output options to the sample configuration.
You can already configure output files and chose between json and ndjson today.
export const config = {
// Specify the output options.
output: {
// Specify the output file. (default = stdout)
file: "results.json",
// Specify the output format. (default = json)
// Options: "json" | "ndjson"
format: "json",
},
}
You can also specify these with a parameter like so:
# Set the output format to ndjson.
$ flyscrape run example.js --output.format ndjson
# Write the output to a file.
$ flyscrape run example.js --output.file results.json
Supporting CSV and SQLite is not that tivial considering the inherently nested data. There have to be a lot of decisions and trade offs to be made and I'm unsure what would be the right ones. Let me present a couple of JSON examples and we can figue out what they could look like in CSV/SQLite, from easy to hard:
{
"url": "example.com",
"data": {
"title": "Example",
"body": "Lorem ipsum."
},
}
url,title,body
example.com,Example,Lorem ipsum.
{
"url": "example.com/page-1",
"data": [
{
"name": "Product A",
"price": 10.50
},
{
"name": "Product B",
"price": 4.20
}
]
}
Expand top level arrays (values of "data") into individual rows.
url,name,price
example.com/page-1,Product A,10.50
example.com/page-1,Product B,4.20
Treat it as JSON if the properties can't be mapped to columns.
url,data
example.com,"[{""name"":""Product A"",""price"":10.50},{""name"":""Product B"",""price"":4.20}]"
{
"url": "example.com/food",
"data": {
"title": "Food",
"categories": [
"Beverages",
"Snacks",
"Meals"
],
"products": [
{
"name": "Coca Cola",
"price": 1.50
},
{
"name": "Fanta Orange",
"price": 1.50
}
]
}
}
url,title,categories,products
example.com/food,Food,"[""Beverages"",""Snacks"",""Meals""]","[{""name"":"Coca Cola"",""price"":1.50.....
Given a extraction function like so:
export default function({ doc }) {
const pagetype = doc.find("body").attr("data-page-type")
if (pagetype == "category") {
return extractCategoryPage(doc);
} else if (pagetype == "product-detail") {
return extractProductDetailPage(doc);
}
}
That could give a JSON like so:
[
{
"url": "example.com/food",
"data": {
"pagetype": "Category",
"products": [
"Coca Cola",
"Fanta Orange"
]
}
},
{
"url": "example.com/food",
"data": {
"pagetype": "Product Detail",
"title": "Coca Cola",
"price": 1.50,
"description": "Very tasty"
}
}
]
Two problems with this:
If anyone has clever solutions to these problems I'm happy to hear them.
@philippta great job
i agree with you that if you have ndjson and files you are good !
Flyscrape already comes with SQLite3 built-in, so it would be possible to expose an API so results could be stored to a SQLite3 database, instead of neing printed to stdout.
If an API gets added, I don't know whether it would be better to have a getenv function to get environment variables to create a path to the database file or would it be better to have a stdpath variable to point to
$XDG_DATA_HOME/flyscrape
. But I think one or the other would be necessary.Thank you for making Flyscrape, it looks great.