turicas / covid19-br

Dados diários mais recentes do coronavírus por município brasileiro
https://brasil.io/dataset/covid19
GNU Lesser General Public License v3.0
531 stars 128 forks source link

Inconsistências na base de dados #180

Closed danielfsbarreto closed 4 years ago

danielfsbarreto commented 4 years ago

Estava ocorrendo um problema recorrente com a execução da action do goodtables do projeto, que foi resolvido em #178. Agora é preciso resolver todas as inconsistências que se acumularam no decorrer desse tempo.

Job: https://github.com/turicas/covid19-br/runs/814979053?check_suite_focus=true#step:3:911

2020-06-28T01:10:34.8665625Z DATASET
2020-06-28T01:10:34.8667294Z =======
2020-06-28T01:10:34.8668860Z {'error-count': 35,
2020-06-28T01:10:34.8670649Z  'preset': 'nested',
2020-06-28T01:10:34.8671296Z  'table-count': 10,
2020-06-28T01:10:34.8671754Z  'time': 54.346,
2020-06-28T01:10:34.8672417Z  'valid': False}
2020-06-28T01:10:34.8672565Z 
2020-06-28T01:10:34.8672771Z TABLE [1]
2020-06-28T01:10:34.8672978Z =========
2020-06-28T01:10:34.8673469Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8673924Z  'encoding': 'no',
2020-06-28T01:10:34.8674388Z  'error-count': 3,
2020-06-28T01:10:34.8674832Z  'format': 'inline',
2020-06-28T01:10:34.8675342Z  'headers': ['date', 'notes', 'state', 'url'],
2020-06-28T01:10:34.8675807Z  'resource-name': 'boletim',
2020-06-28T01:10:34.8676259Z  'row-count': 3310,
2020-06-28T01:10:34.8676716Z  'schema': 'table-schema',
2020-06-28T01:10:34.8677166Z  'scheme': 'inline',
2020-06-28T01:10:34.8677658Z  'source': '/app/data/output/boletim.csv',
2020-06-28T01:10:34.8678134Z  'time': 2.043,
2020-06-28T01:10:34.8678569Z  'valid': False}
2020-06-28T01:10:34.8678974Z ---------
2020-06-28T01:10:34.8679588Z [-,2] [non-matching-header] Header in column 2 doesn't match field name "state" in the schema
2020-06-28T01:10:34.8680249Z [-,3] [non-matching-header] Header in column 3 doesn't match field name "url" in the schema
2020-06-28T01:10:34.8680894Z [-,4] [non-matching-header] Header in column 4 doesn't match field name "notes" in the schema
2020-06-28T01:10:34.8681058Z 
2020-06-28T01:10:34.8681247Z TABLE [2]
2020-06-28T01:10:34.8681715Z =========
2020-06-28T01:10:34.8682221Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8684061Z  'encoding': 'no',
2020-06-28T01:10:34.8684569Z  'error-count': 0,
2020-06-28T01:10:34.8685012Z  'format': 'inline',
2020-06-28T01:10:34.8685457Z  'headers': ['date',
2020-06-28T01:10:34.8685887Z              'state',
2020-06-28T01:10:34.8686329Z              'city',
2020-06-28T01:10:34.8686785Z              'place_type',
2020-06-28T01:10:34.8687478Z              'confirmed',
2020-06-28T01:10:34.8687928Z              'deaths',
2020-06-28T01:10:34.8688404Z              'order_for_place',
2020-06-28T01:10:34.8688861Z              'is_last',
2020-06-28T01:10:34.8689345Z              'estimated_population_2019',
2020-06-28T01:10:34.8689826Z              'city_ibge_code',
2020-06-28T01:10:34.8690342Z              'confirmed_per_100k_inhabitants',
2020-06-28T01:10:34.8690812Z              'death_rate'],
2020-06-28T01:10:34.8691271Z  'resource-name': 'caso',
2020-06-28T01:10:34.8691732Z  'row-count': 263945,
2020-06-28T01:10:34.8692198Z  'schema': 'table-schema',
2020-06-28T01:10:34.8692627Z  'scheme': 'inline',
2020-06-28T01:10:34.8693119Z  'source': '/app/data/output/caso.csv',
2020-06-28T01:10:34.8693577Z  'time': 54.154,
2020-06-28T01:10:34.8694012Z  'valid': True}
2020-06-28T01:10:34.8694133Z 
2020-06-28T01:10:34.8694335Z TABLE [3]
2020-06-28T01:10:34.8694537Z =========
2020-06-28T01:10:34.8695006Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8695470Z  'encoding': 'no',
2020-06-28T01:10:34.8695908Z  'error-count': 0,
2020-06-28T01:10:34.8696350Z  'format': 'inline',
2020-06-28T01:10:34.8696779Z  'headers': ['state',
2020-06-28T01:10:34.8697247Z              'state_ibge_code',
2020-06-28T01:10:34.8697721Z              'city_ibge_code',
2020-06-28T01:10:34.8698169Z              'city',
2020-06-28T01:10:34.8698647Z              'estimated_population'],
2020-06-28T01:10:34.8699143Z  'resource-name': 'populacao-estimada',
2020-06-28T01:10:34.8699600Z  'row-count': 5571,
2020-06-28T01:10:34.8700050Z  'schema': 'table-schema',
2020-06-28T01:10:34.8700496Z  'scheme': 'inline',
2020-06-28T01:10:34.8701313Z  'source': '/app/data/populacao-estimada-2019.csv',
2020-06-28T01:10:34.8701799Z  'time': 0.907,
2020-06-28T01:10:34.8702229Z  'valid': True}
2020-06-28T01:10:34.8702344Z 
2020-06-28T01:10:34.8702547Z TABLE [4]
2020-06-28T01:10:34.8702750Z =========
2020-06-28T01:10:34.8703220Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8703681Z  'encoding': 'no',
2020-06-28T01:10:34.8704125Z  'error-count': 0,
2020-06-28T01:10:34.8704550Z  'format': 'inline',
2020-06-28T01:10:34.8705041Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8705528Z  'resource-name': 'schema-boletim',
2020-06-28T01:10:34.8706201Z  'row-count': 5,
2020-06-28T01:10:34.8706674Z  'schema': 'table-schema',
2020-06-28T01:10:34.8707120Z  'scheme': 'inline',
2020-06-28T01:10:34.8707606Z  'source': '/app/schema/boletim.csv',
2020-06-28T01:10:34.8708043Z  'time': 0.013,
2020-06-28T01:10:34.8708479Z  'valid': True}
2020-06-28T01:10:34.8708618Z 
2020-06-28T01:10:34.8708820Z TABLE [5]
2020-06-28T01:10:34.8709008Z =========
2020-06-28T01:10:34.8709479Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8709927Z  'encoding': 'no',
2020-06-28T01:10:34.8710364Z  'error-count': 0,
2020-06-28T01:10:34.8710806Z  'format': 'inline',
2020-06-28T01:10:34.8711296Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8711836Z  'resource-name': 'schema-caso',
2020-06-28T01:10:34.8712292Z  'row-count': 13,
2020-06-28T01:10:34.8712747Z  'schema': 'table-schema',
2020-06-28T01:10:34.8713192Z  'scheme': 'inline',
2020-06-28T01:10:34.8713667Z  'source': '/app/schema/caso.csv',
2020-06-28T01:10:34.8714116Z  'time': 0.091,
2020-06-28T01:10:34.8714551Z  'valid': True}
2020-06-28T01:10:34.8714663Z 
2020-06-28T01:10:34.8714862Z TABLE [6]
2020-06-28T01:10:34.8715063Z =========
2020-06-28T01:10:34.8715465Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8715672Z  'encoding': 'no',
2020-06-28T01:10:34.8715863Z  'error-count': 0,
2020-06-28T01:10:34.8716162Z  'format': 'inline',
2020-06-28T01:10:34.8716405Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8716645Z  'resource-name': 'schema-populacao-estimada',
2020-06-28T01:10:34.8716856Z  'row-count': 6,
2020-06-28T01:10:34.8717069Z  'schema': 'table-schema',
2020-06-28T01:10:34.8717278Z  'scheme': 'inline',
2020-06-28T01:10:34.8717510Z  'source': '/app/schema/populacao-estimada-2019.csv',
2020-06-28T01:10:34.8717796Z  'time': 0.075,
2020-06-28T01:10:34.8718000Z  'valid': True}
2020-06-28T01:10:34.8718064Z 
2020-06-28T01:10:34.8718145Z TABLE [7]
2020-06-28T01:10:34.8718242Z =========
2020-06-28T01:10:34.8718461Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8718671Z  'encoding': 'no',
2020-06-28T01:10:34.8718875Z  'error-count': 0,
2020-06-28T01:10:34.8719080Z  'format': 'inline',
2020-06-28T01:10:34.8719436Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8719633Z  'resource-name': 'schema-epidemiological-week',
2020-06-28T01:10:34.8719815Z  'row-count': 4,
2020-06-28T01:10:34.8720001Z  'schema': 'table-schema',
2020-06-28T01:10:34.8720184Z  'scheme': 'inline',
2020-06-28T01:10:34.8720388Z  'source': '/app/schema/epidemiological-week.csv',
2020-06-28T01:10:34.8720572Z  'time': 0.138,
2020-06-28T01:10:34.8720745Z  'valid': True}
2020-06-28T01:10:34.8720790Z 
2020-06-28T01:10:34.8720872Z TABLE [8]
2020-06-28T01:10:34.8720959Z =========
2020-06-28T01:10:34.8721148Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8721319Z  'encoding': 'no',
2020-06-28T01:10:34.8721497Z  'error-count': 0,
2020-06-28T01:10:34.8721676Z  'format': 'inline',
2020-06-28T01:10:34.8721872Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8722073Z  'resource-name': 'schema-obito_cartorio',
2020-06-28T01:10:34.8722254Z  'row-count': 35,
2020-06-28T01:10:34.8722624Z  'schema': 'table-schema',
2020-06-28T01:10:34.8722819Z  'scheme': 'inline',
2020-06-28T01:10:34.8723051Z  'source': '/app/schema/obito_cartorio.csv',
2020-06-28T01:10:34.8723261Z  'time': 0.03,
2020-06-28T01:10:34.8723468Z  'valid': True}
2020-06-28T01:10:34.8723520Z 
2020-06-28T01:10:34.8723615Z TABLE [9]
2020-06-28T01:10:34.8723712Z =========
2020-06-28T01:10:34.8723931Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8724136Z  'encoding': 'no',
2020-06-28T01:10:34.8724340Z  'error-count': 0,
2020-06-28T01:10:34.8724544Z  'format': 'inline',
2020-06-28T01:10:34.8724797Z  'headers': ['date', 'epidemiological_year', 'epidemiological_week'],
2020-06-28T01:10:34.8725048Z  'resource-name': 'epidemiological-week',
2020-06-28T01:10:34.8725261Z  'row-count': 3289,
2020-06-28T01:10:34.8725475Z  'schema': 'table-schema',
2020-06-28T01:10:34.8725681Z  'scheme': 'inline',
2020-06-28T01:10:34.8725918Z  'source': '/app/data/epidemiological-week.csv',
2020-06-28T01:10:34.8726129Z  'time': 2.168,
2020-06-28T01:10:34.8726318Z  'valid': True}
2020-06-28T01:10:34.8726385Z 
2020-06-28T01:10:34.8726482Z TABLE [10]
2020-06-28T01:10:34.8726582Z =========
2020-06-28T01:10:34.8726787Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8727572Z  'encoding': 'no',
2020-06-28T01:10:34.8727833Z  'error-count': 32,
2020-06-28T01:10:34.8728359Z  'format': 'inline',
2020-06-28T01:10:34.8728625Z  'headers': ['date',
2020-06-28T01:10:34.8728880Z              'state',
2020-06-28T01:10:34.8729220Z              'epidemiological_week_2019',
2020-06-28T01:10:34.8729509Z              'epidemiological_week_2020',
2020-06-28T01:10:34.8729789Z              'new_deaths_sars_2019',
2020-06-28T01:10:34.8730065Z              'new_deaths_pneumonia_2019',
2020-06-28T01:10:34.8730363Z              'new_deaths_respiratory_failure_2019',
2020-06-28T01:10:34.8730657Z              'new_deaths_septicemia_2019',
2020-06-28T01:10:34.8730947Z              'new_deaths_indeterminate_2019',
2020-06-28T01:10:34.8731229Z              'new_deaths_others_2019',
2020-06-28T01:10:34.8731511Z              'new_deaths_sars_2020',
2020-06-28T01:10:34.8731796Z              'new_deaths_pneumonia_2020',
2020-06-28T01:10:34.8732205Z              'new_deaths_respiratory_failure_2020',
2020-06-28T01:10:34.8732492Z              'new_deaths_septicemia_2020',
2020-06-28T01:10:34.8732782Z              'new_deaths_indeterminate_2020',
2020-06-28T01:10:34.8733062Z              'new_deaths_others_2020',
2020-06-28T01:10:34.8733340Z              'new_deaths_covid19',
2020-06-28T01:10:34.8733613Z              'deaths_sars_2019',
2020-06-28T01:10:34.8733892Z              'deaths_pneumonia_2019',
2020-06-28T01:10:34.8734260Z              'deaths_respiratory_failure_2019',
2020-06-28T01:10:34.8734530Z              'deaths_septicemia_2019',
2020-06-28T01:10:34.8734814Z              'deaths_indeterminate_2019',
2020-06-28T01:10:34.8735091Z              'deaths_others_2019',
2020-06-28T01:10:34.8735363Z              'deaths_sars_2020',
2020-06-28T01:10:34.8735641Z              'deaths_pneumonia_2020',
2020-06-28T01:10:34.8735932Z              'deaths_respiratory_failure_2020',
2020-06-28T01:10:34.8736216Z              'deaths_septicemia_2020',
2020-06-28T01:10:34.8736505Z              'deaths_indeterminate_2020',
2020-06-28T01:10:34.8736767Z              'deaths_others_2020',
2020-06-28T01:10:34.8737038Z              'deaths_covid19',
2020-06-28T01:10:34.8737315Z              'new_deaths_total_2019',
2020-06-28T01:10:34.8737595Z              'new_deaths_total_2020',
2020-06-28T01:10:34.8737866Z              'deaths_total_2019',
2020-06-28T01:10:34.8738138Z              'deaths_total_2020'],
2020-06-28T01:10:34.8738418Z  'resource-name': 'obito_cartorio',
2020-06-28T01:10:34.8738667Z  'row-count': 9882,
2020-06-28T01:10:34.8739041Z  'schema': 'table-schema',
2020-06-28T01:10:34.8739279Z  'scheme': 'inline',
2020-06-28T01:10:34.8739647Z  'source': '/app/data/output/obito_cartorio.csv',
2020-06-28T01:10:34.8739856Z  'time': 4.42,
2020-06-28T01:10:34.8740058Z  'valid': False}
2020-06-28T01:10:34.8740251Z ---------
2020-06-28T01:10:34.8740547Z [-,3] [non-matching-header] Header in column 3 doesn't match field name "new_deaths_pneumonia_2019" in the schema
2020-06-28T01:10:34.8740884Z [-,4] [non-matching-header] Header in column 4 doesn't match field name "new_deaths_pneumonia_2020" in the schema
2020-06-28T01:10:34.8741371Z [-,5] [non-matching-header] Header in column 5 doesn't match field name "new_deaths_respiratory_failure_2019" in the schema
2020-06-28T01:10:34.8741714Z [-,6] [non-matching-header] Header in column 6 doesn't match field name "new_deaths_respiratory_failure_2020" in the schema
2020-06-28T01:10:34.8742042Z [-,7] [non-matching-header] Header in column 7 doesn't match field name "new_deaths_covid19" in the schema
2020-06-28T01:10:34.8742369Z [-,8] [non-matching-header] Header in column 8 doesn't match field name "epidemiological_week_2019" in the schema
2020-06-28T01:10:34.8742691Z [-,9] [non-matching-header] Header in column 9 doesn't match field name "epidemiological_week_2020" in the schema
2020-06-28T01:10:34.8743001Z [-,10] [non-matching-header] Header in column 10 doesn't match field name "deaths_covid19" in the schema
2020-06-28T01:10:34.8743341Z [-,11] [non-matching-header] Header in column 11 doesn't match field name "deaths_respiratory_failure_2019" in the schema
2020-06-28T01:10:34.8743680Z [-,12] [non-matching-header] Header in column 12 doesn't match field name "deaths_respiratory_failure_2020" in the schema
2020-06-28T01:10:34.8744004Z [-,13] [non-matching-header] Header in column 13 doesn't match field name "deaths_pneumonia_2019" in the schema
2020-06-28T01:10:34.8744321Z [-,14] [non-matching-header] Header in column 14 doesn't match field name "deaths_pneumonia_2020" in the schema
2020-06-28T01:10:34.8744592Z [-,15] [extra-header] There is an extra header in column 15
2020-06-28T01:10:34.8744836Z [-,16] [extra-header] There is an extra header in column 16
2020-06-28T01:10:34.8745094Z [-,17] [extra-header] There is an extra header in column 17
2020-06-28T01:10:34.8745349Z [-,18] [extra-header] There is an extra header in column 18
2020-06-28T01:10:34.8745603Z [-,19] [extra-header] There is an extra header in column 19
2020-06-28T01:10:34.8745926Z [-,20] [extra-header] There is an extra header in column 20
2020-06-28T01:10:34.8746190Z [-,21] [extra-header] There is an extra header in column 21
2020-06-28T01:10:34.8746566Z [-,22] [extra-header] There is an extra header in column 22
2020-06-28T01:10:34.8746786Z [-,23] [extra-header] There is an extra header in column 23
2020-06-28T01:10:34.8747006Z [-,24] [extra-header] There is an extra header in column 24
2020-06-28T01:10:34.8747212Z [-,25] [extra-header] There is an extra header in column 25
2020-06-28T01:10:34.8747487Z [-,26] [extra-header] There is an extra header in column 26
2020-06-28T01:10:34.8747705Z [-,27] [extra-header] There is an extra header in column 27
2020-06-28T01:10:34.8747923Z [-,28] [extra-header] There is an extra header in column 28
2020-06-28T01:10:34.8748139Z [-,29] [extra-header] There is an extra header in column 29
2020-06-28T01:10:34.8748356Z [-,30] [extra-header] There is an extra header in column 30
2020-06-28T01:10:34.8748573Z [-,31] [extra-header] There is an extra header in column 31
2020-06-28T01:10:34.8748796Z [-,32] [extra-header] There is an extra header in column 32
2020-06-28T01:10:34.8749004Z [-,33] [extra-header] There is an extra header in column 33
2020-06-28T01:10:34.8749221Z [-,34] [extra-header] There is an extra header in column 34
endersonmaia commented 4 years ago

Os erros são nas tabelas :

Os schemas que são usados no projeto são mantidos aqui o que acaba criando uma duplicidade, sempre que houver manutenção aí tem que atualizar o datapackage.json

/cc @augusto-herrmann

augusto-herrmann commented 4 years ago

Pois é, o ideal seria esses esquemas serem gerados a partir do datapackage.json, e não o contrário.

endersonmaia commented 4 years ago

Pois é, o ideal seria esses esquemas serem gerados a partir do datapackage.json, e não o contrário.

cabe uma issue ou PR aí, identificar onde no código tem referência aos schemas/*.csv, e o datapackage já tá nas dependências do projeto, certamente daria para automatiza isso

endersonmaia commented 4 years ago

Pois é, o ideal seria esses esquemas serem gerados a partir do datapackage.json, e não o contrário.

https://github.com/turicas/brasil.io/issues/204

augusto-herrmann commented 4 years ago

Essa issue aí ainda é outro caminho, diferente do que estamos sugerindo aqui.

Aqui:

Lá:

A questão toda passa pelo processo de desenvolvimento. Hoje, quem desenvolve é o @turicas, e parece que ele prefere começar a definir o esquema pelo banco. Enquanto continuar assim, o banco de dados é que teria que ser então o ponto de partida.

endersonmaia commented 4 years ago

certo, então seria desenvolver um script que geraria o datapackage.json baseado nestes meta-dados, certo ?

turicas commented 4 years ago

Se o datapackage.json atender à demanda que temos hoje (já explico abaixo), então acho que o ideal seria termos apenas o datapackage.json no repositório, assim o Brasil.IO poderia consumir desse arquivo e os arquivos schema/*.csv poderiam ser gerados automaticamente a partir do datapackage.json (ou, quando a rows suportar pgimport e csv2sqlite com data package, eles poderiam ser deletados).

As demandas atualmente são:

Eu não conheço muito da especificação do datapackage, mas se tiver como embutirmos metadados personalizados (esses do Brasil.IO), então podemos começar um processo de migração (ficará bem melhor se for uniformizado assim :).

turicas commented 4 years ago

@augusto-herrmann você, que conhece mais a especificação do data package, acha que atende a essas necessidades acima? Se sim, vamos criar uma issue no repositório do Brasil.IO para tratar disso?

Sobre a geração de documentação da API: como os metadados precisam ficar armazenados na base do Brasil.IO (e não serão exatamente iguais a esse datapackage.json que propus, pois nem sempre o dataset estará super atualizado com relação ao repositório), então faz sentido a geração da documentação da API ser feita automaticamente a partir do banco de dados do Brasil.IO e não do (futuro) datapackage.json.

endersonmaia commented 4 years ago

acho que deveríamos estar discutindo isso lá na issue https://github.com/turicas/brasil.io/issues/204

turicas commented 4 years ago

acho que deveríamos estar discutindo isso lá na issue turicas/brasil.io#204

Concordo. Colei esses meus comentários lá.

augusto-herrmann commented 4 years ago

Os testes estão dando erro novamente. Reabrir esta issue ou criar uma nova?

endersonmaia commented 4 years ago

Os testes estão dando erro novamente. Reabrir esta issue ou criar uma nova?

cria uma nova

deve ter adicionado campos ou mudado a ordem

augusto-herrmann commented 4 years ago

Criada #193.