okfn-brasil / querido-diario

📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.
https://queridodiario.ok.org.br/
MIT License
1.09k stars 396 forks source link

Make the gazzetes available for the public #157

Closed jvanz closed 3 years ago

jvanz commented 4 years ago

During the past days I've being discussing with @sergiomario about run the spiders in production and make the scraped files available in a central web page. The first version does not need to be to fancy. The idea is to run the spider in a server/cluster, store the files and build a simple web page allowing the user to search and read the scraped files.

As the Serenata de Amor already run in the Digital Ocean, I think we can continue in the same provider. All we need in the first version will be a server/k8s cluster, PostgreSQL and a file storage. We can address all these needs with the DO products available.

To achieve this goal we see the follow issues need to be addressed:

  1. Where run spiders, API and web page: a simple server with a cron job or set up something more sofisticated, like a Kubernetes cluster, to run the workloads.
  2. Avoid unnecessary request: If we already collect the gazettes until 02/19/2020. Let's start the spider from 02/20/2020 (By the way, really cool date xD)
  3. Automation: we are few people, we should automate as much as possible.
  4. UX: find some UX wizard to do a cool web page.

@sergiomario, am I forgetting something?

jvanz commented 4 years ago

Further comments....

  1. Even if I liked the kubernetes idea, I thinks we can start with the simplest approach and incrementally improve as we need to add more feature. So let's use simple servers and get improve as the needs show up.
  2. This is something which I don't believe should be too difficult to achieve. But I need to investigate more.
  3. For server configuration, my first option is Ansible . It's simple to use and runs on ssh. I`m not sure if there is some kind of integration with Digital Ocean (to get the server info), but we can add this if needed.
  4. @sergiomario suggested to do something similar to the JusBrasil web site. I`m fine with that.
jvanz commented 4 years ago

By the way, there is an on going discussion about topic 2

https://github.com/okfn-brasil/diario-oficial/issues/86

arturmesquitab commented 4 years ago

Hey guys, what is the status of this issue? Need any help with data vis? Would love to contribute!

jvanz commented 4 years ago

I'm configuring a production server. It's almost done and I'm running the spiders to test if it's working. To version 0.9, our first goal is to have a simple UI allowing user to download gazettes from any city from one single web page and make the access to the data as easy as possible. But I`m not working on that now. I was considering if the Digital Ocean spaces does not have some kind of simple interface already. I need to check that. For the next version, I'm interested creating an API to filter the documents you're interested to. But first we need to have these files somewhere...

We do not have any idea about some more complex data visualization tool. But any suggestion is welcome. I've open a very naive PR some time ago creating a very simple data processing pipeline with a graph databse. But nothing which should run in production.

jvanz commented 4 years ago

@arturmesquitab I think you can start to think about the UI to show the documents for the users. What do you think? It's just a suggestion. You can work on whatever you want. ;)

If you accept my sugestion we can discuss this in the issue #161

alexandrevicenzi commented 4 years ago

@jvanz what about scrapinghub as a partner to run the spiders? We know a guy there :) but not sure if they would sponsor something.

Not fetching files already downloaded or even requesting pages that we already saw is doable on Scrapy.

What needs to be automated?

What kind of UX wizard are you looking for? UX is not the same as UI :thinking: and I can do some tricks in both

jvanz commented 4 years ago

@jvanz what about scrapinghub as a partner to run the spiders? We know a guy there :) but not sure if they would sponsor something.

Yes, we are working on that.

Not fetching files already downloaded or even requesting pages that we already saw is doable on Scrapy.

Yes, there is an ongoing discussion. See #247

What needs to be automated?

Right now, I'm considering automate the deploy. There are simple scripts that I've used to do that during my tests. I do not consider them production ready. But it save me a lot of time.

What kind of UX wizard are you looking for? UX is not the same as UI thinking and I can do some tricks in both

Nothing to complicated. I do not spend time on that. I'm focus on the API now. But I know that OKBR (@sergiomario) is thinking about that.

ogecece commented 3 years ago

@jvanz do you think we could close this? Some cool stuff that is not implemented yet were being discussed here but I think the main idea is done :)

jvanz commented 3 years ago

@jvanz do you think we could close this? Some cool stuff that is not implemented yet were being discussed here but I think the main idea is done :)

Sure. Closing now...