vsoch / watchme

Reproducible watchers for research
https://vsoch.github.io/watchme/
Mozilla Public License 2.0
886 stars 32 forks source link

Exporter layer for watchme #30

Open SCHKN opened 5 years ago

SCHKN commented 5 years ago

Hello @vsoch!

Over the past few days, I used watchme quite a lot to perform web scraping tasks and I found the tool very handy for such tasks. To put some context, I needed a way to export data to external datasources, such as Prometheus (via Pushgateway) in this case.

I decided to develop a new layer on watchme in order to implement exporters. It could be used for example to export data to messaging queues or databases. With the recent development, you can now do :

watchme create weather-watcher --exporter pushgateway

This will create a [exporter-pushgateway] section in the watchme configuration, following templates that are specifically designed for exporters.

[watcher]
active = true
type = urls

[exporter-pushgateway]
url = localhost:9091
type = pushgateway
active = true

Note : I am aware that there is already an export function, but I could not iterate on it, as I found that it was used to export all the content available in the repository.

I decided to export data in the run function of the task lifecycle.

# Finally, finish and export the runs.
        if test is False:
            self.finish_runs(results)
            self.export_runs(results, exporters)

I also added the option to specify a regex when trying to perform scraping specifying an url selection. It looks like this :

[task-temperature]
url = https://www.accuweather.com/en/lu/luxembourg/228714/weather-forecast/228714
selection = .local-temp
get_text = true
func = get_url_selection
active = true
type = urls
regex = [0-9]+
header_user-agent = Mozilla/5.0

This option goes very handy to target only numbers for web scraping.

I developed quite a lot of functions in order to enable exporters and regexes and all the modifications are available on my github on the repo named watchme-prometheus In the end, I was able to run scheduled tasks, exporting data every two seconds from a weather website and exporting data to Pushgateway : https://imgur.com/a/MJDuIUA

I am curious to know if you would be interested by such modifications.

In any cases, I had a ton of fun developing this, and the way the app was built made it very easy to iterate.

Thank you!

vsoch commented 5 years ago

Both of these additions are great! To summarize:

I'm so glad that this was useful to you, and I'm greatly looking forward on these next developments! Let me know your preference based on your availability.

vsoch commented 5 years ago

@SCHKN I want to double check that you saw the "How do I export data" section - you shouldn't need to save /push every file to some database because the changes are saved in git (and then exportable to a flat, temporal structure). https://vsoch.github.io/watchme/getting-started/index.html#how-do-i-export-data

SCHKN commented 5 years ago

I checked the export data section before the changes to see if they could be done in that part of the code.

Unfortunately, as the export function provides a "full" export (comparable to a "dump" in some ways), it didn't fit my needs as I needed data directly as they were recorded by the scraper.

To put things into context, I needed data to be exported directly as it was recorded to be directly visible in Grafana, I hope it makes sense.

If you are still interested in the two changes, I can open two PR's, let me know :)

vsoch commented 5 years ago

Yes, I’d definitely be interested to see! I don’t want to waste your time so I’ll offer to take a look and PR if it’s generalizable enough.

SCHKN commented 5 years ago

Awesome! Sounds good to me.

vsoch commented 5 years ago

Going to add some notes here as I work on this:

$ watchme push <watcher> <exporter>

This seems like a needed function, in case a push is desired without (or separate from) running the watcher. I think it would also be intuitive to allow the user to push "all exported" data, something like:

$ watchme push --all <watcher> <exporter>

That way we could have set up:

  1. A task that runs and pushes a single result via an exporter
  2. A task that runs, but only is exported / pushed by the user selectively
  3. A task that collects a large chunk of data, and then is pushed all at once.

I can see some future need to push a subset of data, but for now (until someone asks for it) experiment with these ideas.

vsoch commented 5 years ago

For the exporters, I'm thinking that we would want to have granularity to match exporters with watcher tasks. Currently, if we add an exporter, if it's active it is (implicitly) active for all tasks. We should be able to turn entire exporters on /off, but have the primary control coming directly from the tasks. For example, here we have two pushgateway exporters, one for each task (and you could imagine one task having more than one exporter).

[watcher]
active = false

[task-air-oakland]
url = http://aqicn.org/city/california/alameda/oakland-west
exporters = [exporter-pushgateway]
func = get_url_selection
selection = #aqiwgtvalue
file_name = oakland.txt
get_text = true
active = true
type = urls

[task-air-boulder]
url = http://aqicn.org/city/usa/colorado/boulder-cu/athens/
func = get_url_selection
selection = #aqiwgtvalue
file_name = boulder.txt
get_text = true
active = true
type = urls

[exporter-pushgateway]
url = localhost:9091
type = pushgateway
active = true

[exporter-another-pushgateway]
url = localhost:9091
type = pushgateway
active = true

@SCHKN could you point me in the right direction to set up the endpoint so I can test as I develop?

SCHKN commented 5 years ago

@vsoch,

I find the push idea interesting, do you plan on adding it to the schedule function in order to push data as it comes in?

Sure! To setup, a pushgateway, you can head over to https://github.com/prometheus/pushgateway and read the Run it section. Depending on your OS, it should be as easy as launching the binary and letting it run. I'm not sure that you actually need Prometheus on the other end to see the results.

Let me know if I can help.

vsoch commented 5 years ago

@SCHKN yes - if a user has added an exporter and it's active and listed for a watcher task, it will run with schedule (this is how we achieve complete automation as you've done!). If the exporter is defined but not listed with any particular task, then it wouldn't be run with the scheduler. If the exporter is defined but not listed with a task and then manually requested with push, it would be run.

Thanks for the tips! I likely won't get this PR open today, but surely within the week. I'll keep you posted!

vsoch commented 5 years ago

@SCHKN do you think it would be more intuitive to have two commands to add each of a watcher task and exporter, for example:

$ watchme add-task watcher task-cpu func@cpu_task type@psutils
$ watchme add-exporter watcher exporter-pushgateway

or have a single add command that determines the addition based on the prefix of what is being added? E.g.,

$ watchme add watcher task-cpu func@cpu_task type@psutils
$ watchme add watcher exporter-pushgateway

I've been implementing the second, but I'm thinking it might be cleaner to (for development down the line) have them as separate entrypoints.

SCHKN commented 5 years ago

When it comes to the actual usage of the application, I do believe that it is preferable to use the add-exporter method.

Even if it adds an extra step, I find it less confusing and more explicit. What are your thoughts on it?

vsoch commented 5 years ago

I totally agree! I'm glad you do too :)