okfn-brasil / querido-diario

📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.
https://queridodiario.ok.org.br/
MIT License
1.12k stars 411 forks source link

Rio de Janeiro/RJ Craw #107

Closed Lucas-Armand closed 2 years ago

Lucas-Armand commented 6 years ago

Hello guys.

I'm having trouble understanding the crawler results from rio de janeiro ...

If I test the crawler of rio de janeiro (following the orientation of CONTRIBUTING.md):

sudo docker-compose run --rm processing bash -c "cd data_collection && scrapy crawl rj_rio_de_janeiro"

The result seems to be wrong:

[...]

2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864> referred in <None>
2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864> referred in <None>
2018-08-27 00:32:53 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.date(2018, 8, 20),
 'file_urls': ['http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864'],
 'files': [{'checksum': '49228de889bf8edd753fad4b184adaa3',
            'path': 'full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok&edi_id=3864',
            'url': 'http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864'}],
 'is_extra_edition': True,
 'power': 'executive',
 'scraped_at': datetime.datetime(2018, 8, 27, 0, 32, 53, 7640),
 'territory_id': '3304557'}
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 14, in process_item
    item["source_text"] = self.pdf_source_text(item)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 29, in pdf_source_text
    with open(text_path) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok&edi_id=3864.txt'
I/O Error: Couldn't open file '/mnt/data/full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok': No such file or directory.
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=21/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=24/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=26/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3869> referred in <None>
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=25/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=28/07/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=30/07/2018> (referer: http://doweb.rio.rj.gov.br)

When I run the crawler of Porto Alegre (for comparison) I get an intelligible result:

sudo docker-compose run --rm processing bash -c "cd data_collection && scrapy crawl rs_porto_alegre"

result is: [...]

                'EXTRATO DE TERMO ADITIVO\n'
                '     PROCESSO: 009.003517.14.4\n'
                '     CONTRATANTE: Departamento Municipal de Previdência dos '
                'Servidores Públicos do Município de Porto Alegre.\n'
                '     CONTRATADA: Agência Estado S/A.\n'
                '     OBJETO: prorrogação do contrato n. 02/2015 de licença de '
                'uso do software AE Broadcast Profissional, 04 pontos de '
                'acesso, por 12 meses, a contar de 01.04.2018.\n'
                '     Valor Mensal: R$ 9.625,04.\n'
                '     BASE LEGAL: Artigo 57, inciso II, da Lei 8.666/93 e suas '
                'alterações.\n'
                '\n'
                '                                                                                  '
                'Porto Alegre, 24 de abril de 2018.\n'
                '\n'
                '\n'
                '                                                                             '
                'RENAN DA SILVA AGUIAR, Diretor-Geral.\n'
                '\n'
                '\n'
                '\n'
                '\n'
                '      EXPEDIENTE\n'
                '\n'
                '\n'
                '      PREFEITURA MUNICIPAL DE PORTO ALEGRE\n'
                '      Diário Oficial Eletrônico de Porto Alegre\n'
                '      Órgão de Divulgação Oficial do Município\n'
                '      Instituído pela Lei nº 11.029 de 3 de janeiro de 2011\n'
Lucas-Armand commented 6 years ago

If I try to access 'http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864' (for example) I access only the "first page" of the diário ofícial...

A very similar result for the other url addresses ...

I do not know if I'm doing something wrong, or something in the DO server service change but I would like an orientation...

Thanks!

cuducos commented 6 years ago

Hi @Lucas-Armand,

From your messages I got the impression you're trying to read Scrapy output as the results of the crawler. Please, correct me if I'm wrong.

Scrapy output are just logs to give you an idea about what's going on. The proper results are stored in the PostgreSQL. Have you checked this database too?

rennerocha commented 2 years ago

Considering that we changed how the project is structured and how to run spiders locally, this can be closed.