Dataset checksum and avoid download duplicate

willianpaixao commented 6 years ago

What is the problem? The process of downloading datasets takes quite a long time. The downloader class is not very effective. (Ref #87) Plus, if retriggered, the script will re-download the files even if the local version exists and is not corrupted. This way, running all tests takes a long time, even if you already have the datasets. (Ref #122)

How can this be addressed?

[ ] Rewrite fetch, translate, clean into more atomic methods (Ref #87)
[ ] Migration to Dask (Suggested by @cuducos)
[ ] Use a better library for downloading and sync datasets (tox was a suggestion by @cuducos).
[ ] The checksum file could be added to the bucket to assure we have downloaded correctly and most importantly, the "verified" file.

Labels [enhancement]

cuducos commented 6 years ago

I think there's a little bit of confusion here and there.

The process of downloading datasets takes quite a long time.

Yes, it deals with a lot of data so this is basically normal. I wouldn't worry about that. It downloads roughly 10 datasets with 300k lines each, cleans them, translate them, so it's just normal to take a few minutes as it does. To sum: this is not an issue IMHO.

The downloader class is not very effective. (Ref #87)

In my opening post I don't recall questioning the effectiveness of the downloader class. My main point there was semantic, architecture and design – not performance.

Plus, if retriggered, the script will re-download the files even if the local version exists and is not corrupted.

This is on purpose: we want to always run with fresh official data. It happens that official institutions change their datasets, so if you run from the scratch, it will re-do all work to make sure your data is official and updated.

Alternatively, you can skip this and download some versioned datasets as explained in the README.md.

This way, running all tests takes a long time, even if you already have the datasets.

As the discussion followed, journey tests are basically to be run on the CI, to assure everything is working. So this long time should not be an issue for the majority of developer around here.

That said I do disagree with the tone of your opening post. However I do agree with parts of your proposed TODO list:

Rewrite fetch, translate, clean into more atomic methods (Ref #87)

Yes, but we don't need an issue for that. That's what #87 is about.

Migration to Dask (Suggested by @cuducos)

That would be wonderful. Can we have an issue for that?

Use a better library for downloading and sync datasets (tox was a suggestion by @cuducos).

Sorry, but I think you got me wrong here. tox is about tests, not about syncing datasets. No idea how tox can help in downloading. Would you mind clarifying?

The checksum file could be added to the bucket to assure we have downloaded correctly and most importantly, the "verified" file.

Finally something that matches the issue title. But with such a long description not mentioning checksum at all, my opinion is as follows:

Close this issue as it's imprecise and convoluted
Open an issue for Dask
Open an issue for the checksum

All the rest seems to be already embraced by #87 and #122. What do you think, @willianpaixao? Was I too harsh? I'm just trying to keep thing organized to make it easier for everyone ; )

willianpaixao commented 6 years ago

Yes, it deals with a lot of data so this is basically normal. I wouldn't worry about that. It downloads roughly 10 datasets with 300k lines each, cleans them, translate them, so it's just normal to take a few minutes as it does. To sum: this is not an issue IMHO.

This could still be speed up with async tasks and maybe "last modification date" of the files. No need to re-download a file that is untouched for years. The checksum can also help on that.

In my opening post I don't recall questioning the effectiveness of the downloader class. My main point there was semantic, architecture and design – not performance.

That was my summary of the whole issue discusstion and code review. I didn't say you said that.

That said I do disagree with the tone of your opening post.

My appologies, didn't mean to sound rude. 😭

Sorry, but I think you got me wrong here. tox is about tests, not about syncing datasets. No idea how tox can help in downloading. Would you mind clarifying?

Yes, I mixed up things. async or another library can paralelize bigger tasks. tox can improve the tests.

Was I too harsh?

Yes, I'm now crying under the bed. 😭 💔

As the suggested actions, I'll follow all of them. But I'll wait some time before closing this ticket so you can add something.

cuducos commented 6 years ago

This could still be speed up with async tasks and maybe "last modification date" of the files. No need to re-download a file that is untouched for years. The checksum can also help on that.

From our experience, we cannot trust the file date, but again: the checksum and asyncio (as we did in the downloader.py) are indeed great ideas!

My appologies, didn't mean to sound rude. 😭

No means… I disagree with the arguments, not with your manners. I'm sorry too ; )

Hahaha… so let's go for a few atomic issues: Refactor downloads using asyncio, implementing checksum (not sure hwo to do it on Spaces side but anyway), and for Dask ; )

okfn-brasil / serenata-toolbox

Dataset checksum and avoid download duplicate #197