thegraphnetwork / epigraphhub_py

Epigraphhub Python package
GNU General Public License v3.0
2 stars 9 forks source link

chore(DAG-SINAN): proposal to split DAGs by disease #211

Closed luabida closed 1 year ago

luabida commented 1 year ago

follows https://github.com/thegraphnetwork/EpiGraphHub/pull/170

depends on https://github.com/AlertaDengue/PySUS/pull/117

luabida commented 1 year ago

2f84a99 fixes:

Traceback (most recent call last):
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1471, in _run_raw_task
    self._execute_task_with_callbacks(context, test_mode)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1618, in _execute_task_with_callbacks
    result = self._execute_task(context, task_orig)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1679, in _execute_task
    result = execute_callable(context=context)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/airflow/decorators/base.py", line 179, in execute
    return_value = super().execute(context)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/brasil/sinan.py", line 77, in upload
    raise e
  File "/opt/airflow/dags/brasil/sinan.py", line 74, in upload
    loading.upload(parquet_dirs)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/epigraphhub/data/brasil/sinan/loading.py", line 79, in upload
    upsert_df_in_chunks(df)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/epigraphhub/data/brasil/sinan/loading.py", line 77, in upsert_df_in_chunks
    raise e
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/epigraphhub/data/brasil/sinan/loading.py", line 55, in upsert_df_in_chunks
    upsert(
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/pangres/core.py", line 302, in upsert
    executor.execute(connectable=con, if_row_exists=if_row_exists, chunksize=chunksize)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/pangres/executor.py", line 87, in execute
    pse.upsert(if_row_exists=if_row_exists, chunksize=chunksize)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/pangres/engine.py", line 551, in upsert
    upq.execute(db_type=self._db_type, values=chunk, if_row_exists=if_row_exists)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/pangres/upsert_query.py", line 231, in execute
    return self.connection.execute(query)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1380, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
    ret = self._execute_context(
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
    self._handle_dbapi_exception(
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2128, in _handle_dbapi_exception
    util.raise_(exc_info[1], with_traceback=exc_info[2])
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/opt/conda/envs/epigraphhub/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    try:
ValueError: A string literal cannot contain NUL (0x00) characters.
luabida commented 1 year ago

Ready for review & merge.

luabida commented 1 year ago

@fccoelho I've reduced the PDFs into xlsx sheets, do you think is it ok to keep these files in the repo? I'm starting a module to extract these sheets into dataframes. Note that not every disease was included in the tar file that was sent to me

luabida commented 1 year ago
2023-02-23 15:16:54.801 | ERROR    | __main__:metadata_df:49 - Metadata not available for Cancer
2023-02-23 15:16:57.439 | ERROR    | __main__:metadata_df:49 - Metadata not available for Contact Communicable Disease
2023-02-23 15:16:57.439 | ERROR    | __main__:metadata_df:49 - Metadata not available for Acidentes de Trabalho
2023-02-23 15:17:09.622 | ERROR    | __main__:metadata_df:49 - Metadata not available for Poliomielite
2023-02-23 15:17:10.480 | ERROR    | __main__:metadata_df:49 - Metadata not available for Sífilis Adquirida
2023-02-23 15:17:14.665 | ERROR    | __main__:metadata_df:49 - Metadata not available for Violência Domestica
2023-02-23 15:17:14.666 | ERROR    | __main__:metadata_df:49 - Metadata not available for Zika
luabida commented 1 year ago

Metadata columns comparative with Animais Peçonhentos:

In [27]: for column in ANIM_parquet.columns:
    ...:     if column not in metadata_dataframe.columns:
    ...:         print(column)
TP_NOT
ID_AGRAVO
DT_NOTIFIC
SEM_NOT
NU_ANO
SG_UF_NOT
ID_MUNICIP
ID_REGIONA
DT_SIN_PRI
SEM_PRI
DT_NASC
NU_IDADE_N
CS_SEXO
CS_GESTANT
CS_RACA
CS_ESCOL_N
SG_UF
ID_MN_RESI
ID_RG_RESI
ID_PAIS
NU_AMPO_7
NU_AMPO_5
COM_COMPOR
DT_DIGITA
luabida commented 1 year ago
In [28]: list(ANIM_parquet.columns)
Out[28]: 
['TP_NOT',
 'ID_AGRAVO',
 'DT_NOTIFIC',
 'SEM_NOT',
 'NU_ANO',
 'SG_UF_NOT',
 'ID_MUNICIP',
 'ID_REGIONA',
 'DT_SIN_PRI',
 'SEM_PRI',
 'DT_NASC',
 'NU_IDADE_N',
 'CS_SEXO',
 'CS_GESTANT',
 'CS_RACA',
 'CS_ESCOL_N',
 'SG_UF',
 'ID_MN_RESI',
 'ID_RG_RESI',
 'ID_PAIS',
 'DT_INVEST',
 'ID_OCUPA_N',
 'ANT_DT_ACI',
 'ANT_UF',
 'ANT_MUNIC_',
 'ANT_LOCALI',
 'ANT_ZONA',
 'ANT_TEMPO_',
 'ANT_LOCA_1',
 'MCLI_LOCAL',
 'CLI_DOR',
 'CLI_EDEMA',
 'CLI_EQUIMO',
 'CLI_NECROS',
 'CLI_LOCAL_',
 'CLI_LOCA_1',
 'MCLI_SIST',
 'CLI_NEURO',
 'CLI_HEMORR',
 'CLI_VAGAIS',
 'CLI_MIOLIT',
 'CLI_RENAL',
 'CLI_OUTR_2',
 'CLI_OUTR_3',
 'CLI_TEMPO_',
 'TP_ACIDENT',
 'ANI_TIPO_1',
 'ANI_SERPEN',
 'ANI_ARANHA',
 'ANI_LAGART',
 'TRA_CLASSI',
 'CON_SOROTE',
 'NU_AMPOLAS',
 'NU_AMPOL_1',
 'NU_AMPOL_8',
 'NU_AMPOL_6',
 'NU_AMPOL_4',
 'NU_AMPO_7',
 'NU_AMPO_5',
 'NU_AMPOL_9',
 'NU_AMPOL_3',
 'COM_LOC',
 'COM_SECUND',
 'COM_NECROS',
 'COM_COMPOR',
 'COM_DEFICT',
 'COM_APUTAC',
 'COM_SISTEM',
 'COM_RENAL',
 'COM_EDEMA',
 'COM_SEPTIC',
 'COM_CHOQUE',
 'DOENCA_TRA',
 'EVOLUCAO',
 'DT_OBITO',
 'DT_ENCERRA',
 'DT_DIGITA']
fccoelho commented 1 year ago

@luabida it not worth investing too much into this issue beside the type casting and metadata, Because datasus will migrate all of SINAN to a new platform in the near future.

luabida commented 1 year ago

@fccoelho it is not possible to change the method that extracts the data from SINAN DBCs, it is used to other pysus data as well: https://github.com/AlertaDengue/PySUS/blob/master/pysus/online_data/__init__.py#L108. because of that, I'm rewriting some of the methods so it could be done with no break in the code for other data in Pysus

luabida commented 1 year ago

https://github.com/thegraphnetwork/epigraphhub_py/actions/runs/4341044635 depends on PySUS PR merge & release

github-actions[bot] commented 1 year ago

:tada: This PR is included in version 2.0.4 :tada:

The release is available on:

Your semantic-release bot :package::rocket: