thegraphnetwork / epigraphhub_py

Epigraphhub Python package
GNU General Public License v3.0
2 stars 9 forks source link

Create separate directories for each disease in sinan_fetch script #189

Closed esloch closed 1 year ago

esloch commented 1 year ago

🚀 Feature Request

Improve the structure of files downloaded by disease for reading DAGs in airflow.

luabida commented 1 year ago

Notes: The way it is right now, the Airflow will try to fetch the 150+ parquet files to database at the same time, causing too much memory overhead.

Possible solution: Pre-fetch the years for each disease, where Tasks won't differentiate by year, only by disease. Example: download('zika')

The download method should be able to retrieve all the years provided by each disease on SINAN's database and insert them into database. Another function can be done to remove the directory of a certain disease that could be executed by a task (that would be executed at the end of all tasks) or even after inserting data into postgres, called by load_to_db.

This way Airflow will handle better the asynchronicity (even if takes more time to finish a disease).

fccoelho commented 1 year ago

@luabida you mean having one task per disease having them iterate over the multiple years? If so, I agree.

luabida commented 1 year ago

Yes, I was able to do this way, but during the tests, a single decode error on PySUS is capable of ruining the process. I will need to modify the SINAN script to make this work.

NOTE: 141 decode errors during the download test, maybe we will have to do it differently

fccoelho commented 1 year ago

That is an issue that needs

NOTE: 141 decode errors during the download test, maybe we will have to do it differently

This is an Issue that should be fixed and tested in PySUS. Please Open an issue there, @luabida .

luabida commented 1 year ago

funny enough, I've found a package for reading DBC files inspired on PySUS 😂 https://github.com/gbletsch/pydbc

Trying to see if this package can help us out here

xmnlab commented 1 year ago

Interesting :) is it better than pyreaddbc?

On Thu, Nov 3, 2022, 07:37 Luã Bida Vacaro @.***> wrote:

funny enough, I've found a package for reading DBC files inspired on PySUS 😂 https://github.com/gbletsch/pydbc

Trying to see if this package can help us out here

— Reply to this email directly, view it on GitHub https://github.com/thegraphnetwork/epigraphhub_py/issues/189#issuecomment-1301973431, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHX5HOZMLDWMIJ3UIL5BYDWGOPYBANCNFSM6AAAAAAREO6CKI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

luabida commented 1 year ago

is it better than pyreaddbc?

I wasn't aware of this package, thanks for sharing. About pydbc, I couldn't make it work, maybe some C dependencies that I've no experience with.

luabida commented 1 year ago

Converting DBC files to DBF is very quick, but converting DBC or DBF files to dataframe consumes tons of memory (at least on my machine). Do you know if there is a way of getting rid of this middle process, send the DBF file directly into the SQL server and get the dataframes via SQL queries?

github-actions[bot] commented 1 year ago

Stale issue message