steffenfritz / FileTrove

FileTrove indexes files and creates metadata from them.
https://filetrove.fritz.wtf
GNU Affero General Public License v3.0
26 stars 5 forks source link

[CHANGE] Investigate ways of optimizing the NSRL database download #46

Closed ross-spencer closed 3 months ago

ross-spencer commented 3 months ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The NSRL database is currently 4gb+ and takes over an hour to download. I have investigated using a waitgroup to download chunks in multiple goroutines here: https://github.com/ross-spencer/FileTrove/pull/1 <-- this takes between 15 and 30 minutes off the download. That being said it can still take up to an hour to get this file.

Describe the solution you'd like A clear and concise description of what you want to happen.

One option is to consider merging https://github.com/ross-spencer/FileTrove/pull/1.

Another is to consider compressing the bolt db as it compresses fairly efficiently:

4.0G    nsrl.db
1.3G    nsrl.tar.xz

This would also reduce download times, but would require a decompression function client side. It might still be complimented by other download options such as https://github.com/melbahja/got.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Also considered was enabling gzip compression in nginx for application/octet-stream but this would prevent chunking (i believe). It may simply be quicker to host a compressed file and allow it to be extracted by the app.

steffenfritz commented 3 months ago

Thanks for the pull request!

Thinking about your proposed ways to optimize the download I like the compressed file more as it reduces the size on the wire.

  1. The compression takes time only when creating a new version from RDS. So no impact for users.
  2. The decompression takes time after the download with ftrove once but it was faster more than 4 times than the download.

So, if I don't miss something gzip would be suitable:

1,5G  1 Apr 15:22 nsrl.db.gz
4,0G  1 Apr 15:22 nsrl.db
ross-spencer commented 3 months ago

That seems like a pretty good compression rate!