monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Delete incomplete cached file during ingest download? #600

Closed ptgolden closed 2 weeks ago

ptgolden commented 3 weeks ago

While running ingest download --all, I encountered a couple errors. One on my end due to ending the process prematurely, one due to a network disruption. Running the command again would pick up the ingest, but it would count the file being downloaded when the error occurred as cached, instead of attempting to re-download it.

To recreate, run ingest download --all, and press ^C to send an interrupt to the program. If the script was in the middle of downloading a file, it will appear in the data/ directory as an empty file.

An easy fix would be to delete files when an error occurs here: https://github.com/monarch-initiative/monarch-ingest/blob/24f9de3047e9c762d9d2fb3f757858c948c0b162/src/monarch_ingest/main.py#L44-L52

A (much) more complicated fix would involve supporting partial downloads in monarch-initiative/kghub-downloader.

ptgolden commented 3 weeks ago

Actually, I think the issue is here: https://github.com/monarch-initiative/kghub-downloader/blob/2c0f3d2b2d262e986f4c764bc69516fc5825c260/kghub_downloader/download_utils.py#L209

Instead of opening the file at the same time as the request, it should only be opened when it's ready to be written. I'm happy to open a PR if that sounds okay.