urbanogilson / SICAR

This tool is designed for students, researchers, data scientists or anyone who would like to have access to SICAR files
https://urbanogilson.github.io/posts/sicar/
MIT License
76 stars 35 forks source link

Saving HTML instead of Shapefile #15

Closed iagomachadocs closed 1 year ago

iagomachadocs commented 1 year ago

Sometimes, when attempting to download a shapefile, the HTTP request fails and an HTML file is returned instead. However, the script saves the HTML content as if it were the requested shapefile.

This issue occurred when I was trying to download data from the 'PE' state. The file SHAPE_2607109.zip was saved normally, but it contained the following content:

<!DOCTYPE html>

<html lang="pt-br">
    <head>
        <title>Im&oacute;veis</title>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
        <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/font-awesome.css">
        <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/bootstrap.min.css">
        <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/bootstrap-theme.min.css">
        <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/main.css">
        <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/navbar.css">
            <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/leaflet.css">
    <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/imoveis.css">
    <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/easy-button.css">
    <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/leaflet-custom.css">
    <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/menu-layers.css">
    <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/Leaflet.GraphicScale.min.css">
    <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/leaflet-search.css">
    <link rel="stylesheet" media="screen" href="/publico/public/stylesheets/pace.css">
        <link rel="shortcut icon" type="image/png" href="/publico/public/images/favicon.ico">
</head>
<body>

    <nav class="navbar navbar-inverse navbar-fixed-top visible-sm visible-xs">
        <div class="container-fluid">
            <div class="navbar-header">
                <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
                    <span class="sr-only">Mostrar menu</span>
                    <span class="icon-bar"></span>
                    <span class="icon-bar"></span>
                    <span class="icon-bar"></span>
                </button>
                <a class="navbar-brand inicial-relatorio" href="#"><img src="/publico/public/images/logo_car.png" class="img-responsive"></a>
            </div>
            <div id="navbar" class="navbar-collapse collapse">
                <ul class="nav navbar-nav navbar-right">
                    <li class="active"><a href="/publico/imoveis/index">IMÓVEIS</a></li>
                    <!-- <li class=""><a href="/publico/tematicos/regularidade">REGULARIDADE</a></li> -->
                    <li class=""><a href="/publico/tematicos/restricoes">RESTRIÇÕES</a></li>
                    <li class=""><a href="/publico/municipios/downloads">BASE DE DOWNLOADS</a></li>
                </ul>
            </div>
        </div>
    </nav>

    <div class="container-fluid">
        <div class="row">

            <div class="col-md-3 col-lg-2 hidden-sm hidden-xs" id="list-acoes">

                <a href="/publico/imoveis/index"><img src="/publico/public/images/logo_car.png" class="img-responsive inicial-relatorio"></a>

                <ul class="nav nav-pills nav-stacked">
                    <li role="presentation" class="active"><a href="/publico/imoveis/index">IMÓVEIS</a></li>
                    <!-- <li role="presentation" class=""><a href="/publico/tematicos/regularidade">REGULARIDADE</a></li> -->
                    <li role="presentation" class=""><a href="/publico/tematicos/restricoes">RESTRIÇÕES</a></li>
                    <li role="presentation" class=""><a href="/publico/municipios/downloads">BASE DE DOWNLOADS</a></li>
                </ul>

                <div class="menu-footer col-md-3 col-lg-2">

                    <a href="http://www.florestal.gov.br/" target="_blank"><img src="/publico/public/images/logo_sfb.png" class="img-responsive"></a>

                                            <i class="small">Última atualização dos dados em :
11/04/2023
</i>

                    <div class="versao">Versão 1.0</div>
                </div>

            </div>

            <div class="col-md-9 col-lg-10 col-sm-12 col-xs-12 col-right">

<div id="mapa-imoveis"></div>

<div style="display:none" class="quadro-informativo">

    <div class="fundo-branco">

        <img src="/publico/public/images/bandeiras/br.png"/>
        <b class="regiao">Brasil</b>

        <div class="dados-brasil">

            <div class="quadro-dados">
                <h5>total de imóveis:</h5>
                <h3 class="total-de-imoveis">6.521.253</h3>
                <div class="progress">
                    <div class="progress-bar progress-bar-success ativo" style="width: 100.0%">

                    </div>
                    <div class="progress-bar progress-bar-warning pendente" style="width: 3.5094579206645933%">

                    </div>
                    <div class="progress-bar progress-bar-danger cancelado" style="width: 0.7577696337313062%">

                    </div>
                </div>
            </div>

            <div class="quadro-dados area-cadastrada">
                <!-- <div id="circle">
                    <div></div>
                    <b>100.0</b>
                </div> -->
                <h5 class="circle">Área cadastrada:</h5>
                <h4 class="circle area-cadastrada">651.771.041,48 ha</h4>
            </div>

            <div class="quadro-dados botao-download">

                <a href="../municipios/downloads">
                    <div class="fa fa-cloud-download"></div>
                    <div>Downloads</div>
                </a>
            </div>

        </div>

    </div>

</div>

            </div>

        </div>

    </div>

    <script src="/publico/public/javascripts/jquery-3.1.1.min.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/bootstrap.min.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/lodash.core.min.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/formatacao.js" type="text/javascript" charset="utf-8"></script>
    <script src="https://maps.googleapis.com/maps/api/js?v=3.2&key=AIzaSyAiU1s8eAgYTI09A6awKaPZfOomgAv74tU"
        async defer></script>
    <script src="/publico/public/javascripts/layers/ufs.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/layers/cidades.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/layers/cidade.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/layers/imovel.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/leaflet.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/FileSaver.min.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/leaflet.edgebuffer.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/Leaflet.GraphicScale.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/leaflet-search.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/easy-button.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/circle-progress.min.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/menuLayers.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/layers/wmsTile.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/layers/camadas.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/layers/arvoreCamadas.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/search-button.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/Leaflet.GoogleMutant.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/pace.min.js" type="text/javascript" charset="utf-8"></script>
    <script src="/publico/public/javascripts/imoveis.js" type="text/javascript" charset="utf-8"></script>
</body>
</html>
urbanogilson commented 1 year ago

It looks like the session timed out, but was this an isolated case or did all downloads fail after this one?

What was the progress bar output? 5.41M has been downloaded?

car.download_city_code('2607109', folder='PE', debug=True)

[25] - Invalid captcha 'pvhinu' to request city '2607109' in 'shapefile' format
[24] - Requesting city '2607109' in 'shapefile' format with captcha 'EWCPa'
[24] - Failed to download shapefile! When requesting city '2607109' in 'shapefile' format
[23] - Requesting city '2607109' in 'shapefile' format with captcha '6130F'
[23] - Failed to download shapefile! When requesting city '2607109' in 'shapefile' format
[22] - Invalid captcha 'xyAZ' to request city '2607109' in 'shapefile' format
[21] - Requesting city '2607109' in 'shapefile' format with captcha 'dfydT'
Downloading Shapefile for city with code '2607109': 100%|██████████| 5.41M/5.41M [00:02<00:00, 2.37MiB/s]
PosixPath('PE/SHAPE_2607109.zip')
iagomachadocs commented 1 year ago

After this error, some cities were downloaded normally but the same happened with 2 other files.

iagomachadocs commented 1 year ago

I no longer have the progress bar output, but I'll try to reproduce the error again and check this.

iagomachadocs commented 1 year ago

The same error occurred again with another file. It only downloaded 7.58kB of the HTML content for the city with code 2610806.

image

urbanogilson commented 1 year ago

Perfect reproduction. It looks like we can try to mitigate it by retrying the download.

Comparing a normal request (HTML) and a Shapefile, we have a clear difference in the responses. The Shapefile and CSV responses include a different Content-Type, and new fields Content-Transfer-Encoding, Content-Transfer-Encoding, Accept-Ranges, etc.

Instead of total_size = int(response.headers.get("Content-Length", 0)) we can check if Content-Length exists and is greater than zero, and check if there are other fields in the request-response too. Otherwise, throw FailedToDownloadShapefileException or FailedToDownloadCsvException and it will retry automatically.

What do you think, and would you like to implement it @iagomachadocs?

HTML

HTTP/1.1 200 OK
Server: nginx/1.12.2
Date: Mon, 03 Jul 2023 17:54:12 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: PLAY_FLASH=; Max-Age=0; Expires=Mon, 03 Jul 2023 17:54:12 GMT; Path=/publico/
Set-Cookie: PLAY_ERRORS=; Max-Age=0; Expires=Mon, 03 Jul 2023 17:54:12 GMT; Path=/publico/
Set-Cookie: PLAY_SESSION=7d07242ac33ae6249acd11a24a0a30c67fedd411-___ID=e8ae6dc5-1ff8-4170-bf97-4b10af056e9f; Path=/publico/
Cache-Control: no-cache
Content-Security-Policy: upgrade-insecure-requests
Access-Control-Allow-Origin: *
Content-Security-Policy: upgrade-insecure-requests
Content-Encoding: gzip

CSV

HTTP/1.1 200 OK
Server: nginx/1.12.2
Date: Mon, 03 Jul 2023 17:57:29 GMT
Content-Type: text/csv; charset=utf-8
Content-Length: 193737
Connection: keep-alive
Content-Transfer-Encoding: binary
Content-Disposition: attachment; filename=1200252.csv
Set-Cookie: PLAY_FLASH=; Max-Age=0; Expires=Mon, 03 Jul 2023 17:57:29 GMT; Path=/publico/
Set-Cookie: PLAY_ERRORS=; Max-Age=0; Expires=Mon, 03 Jul 2023 17:57:29 GMT; Path=/publico/
Set-Cookie: PLAY_SESSION=7d07242ac33ae6249acd11a24a0a30c67fedd411-___ID=e8ae6dc5-1ff8-4170-bf97-4b10af056e9f; Path=/publico/
Cache-Control: max-age=3600
Last-Modified: Thu, 06 Apr 2023 21:00:17 GMT
ETag: "1680814817000--1269809650"
Accept-Ranges: bytes
Content-Security-Policy: upgrade-insecure-requests
Access-Control-Allow-Origin: *
Content-Security-Policy: upgrade-insecure-requests

Shapefile

HTTP/1.1 200 OK
Server: nginx/1.12.2
Date: Mon, 03 Jul 2023 17
Content-Type: application/zip
Content-Length: 20357265
Connection: keep-alive
Content-Transfer-Encoding: binary
Content-Disposition: attachment; filename=SHAPE_1200104.zip
Set-Cookie: PLAY_FLASH=; Max-Age=0; Expires=Mon, 03 Jul 2023 17:54:34 GMT; Path=/publico/
Set-Cookie: PLAY_ERRORS=; Max-Age=0; Expires=Mon, 03 Jul 2023 17:54:34 GMT; Path=/publico/
Set-Cookie: PLAY_SESSION=7d07242ac33ae6249acd11a24a0a30c67fedd411-___ID=e8ae6dc5-1ff8-4170-bf97-4b10af056e9f; Path=/publico/
Cache-Control: max-age=3600
Last-Modified: Thu, 06 Apr 2023 20:58:09 GMT
ETag: "1680814689000-44873666"
Accept-Ranges: bytes
Content-Security-Policy: upgrade-insecure-requests
Access-Control-Allow-Origin: *
Content-Security-Policy: upgrade-insecure-requests
iagomachadocs commented 1 year ago

Exactly what I was thinking. Yes, I would like to implement it.

iagomachadocs commented 1 year ago

I just opened PR #16 with this implementation