Refactor data sync part of data release

int-brain-lab / iblalyx

MIT License

0 stars 0 forks source link

Refactor data sync part of data release #48

Closed juhuntenburg closed 5 months ago

juhuntenburg commented 6 months ago

https://github.com/int-brain-lab/iblalyx/blob/main/releases/03_copy_data.sh

This takes too long Also, the aggregates sync doesn't seem to be run, unclear why

juhuntenburg commented 5 months ago

Aggregates didn't run because the previous sync had warnings. I fixed those and made a note in the recipe to look out for that

juhuntenburg commented 5 months ago

When no new datasets need syncing, the AWS sync is pretty ok. What's very long is the symlink creation. Check if we can do something about that

2024-01-03 17:24:06 Creating symlinks creating symlinks 20000/70338 creating symlinks 40000/70338 creating symlinks 60000/70338 2024-01-03 20:25:25 Syncing to public S3 bucket data/ 2024-01-03 21:07:54 Syncing to public S3 bucket aggregates/ 2024-01-03 21:08:07 Finished

juhuntenburg commented 5 months ago

Tried to optimize symlinks part but for now doesn't seem like we can save a lot of time as we need to loop through all datasets to create the paths. Don't see much of a way forward other than only creating symlinks for newly released datasets, but I kind of like that we check everything and make sure it's clean every release. Will leave it like this for now

juhuntenburg commented 5 months ago

commit 6ec6c83