Closed esloch closed 1 year ago
Notes: The way it is right now, the Airflow will try to fetch the 150+ parquet files to database at the same time, causing too much memory overhead.
Possible solution: Pre-fetch the years for each disease, where Tasks won't differentiate by year, only by disease. Example: download('zika')
The download method should be able to retrieve all the years provided by each disease on SINAN's database and insert them into database. Another function can be done to remove the directory of a certain disease that could be executed by a task (that would be executed at the end of all tasks) or even after inserting data into postgres, called by load_to_db
.
This way Airflow will handle better the asynchronicity (even if takes more time to finish a disease).
@luabida you mean having one task per disease having them iterate over the multiple years? If so, I agree.
Yes, I was able to do this way, but during the tests, a single decode error on PySUS is capable of ruining the process. I will need to modify the SINAN script to make this work.
NOTE: 141 decode errors during the download test, maybe we will have to do it differently
That is an issue that needs
NOTE: 141 decode errors during the download test, maybe we will have to do it differently
This is an Issue that should be fixed and tested in PySUS. Please Open an issue there, @luabida .
funny enough, I've found a package for reading DBC files inspired on PySUS 😂 https://github.com/gbletsch/pydbc
Trying to see if this package can help us out here
Interesting :) is it better than pyreaddbc?
On Thu, Nov 3, 2022, 07:37 Luã Bida Vacaro @.***> wrote:
funny enough, I've found a package for reading DBC files inspired on PySUS 😂 https://github.com/gbletsch/pydbc
Trying to see if this package can help us out here
— Reply to this email directly, view it on GitHub https://github.com/thegraphnetwork/epigraphhub_py/issues/189#issuecomment-1301973431, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHX5HOZMLDWMIJ3UIL5BYDWGOPYBANCNFSM6AAAAAAREO6CKI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
is it better than pyreaddbc?
I wasn't aware of this package, thanks for sharing. About pydbc, I couldn't make it work, maybe some C dependencies that I've no experience with.
Converting DBC files to DBF is very quick, but converting DBC or DBF files to dataframe consumes tons of memory (at least on my machine). Do you know if there is a way of getting rid of this middle process, send the DBF file directly into the SQL server and get the dataframes via SQL queries?
Stale issue message
🚀 Feature Request
Improve the structure of files downloaded by disease for reading DAGs in airflow.