WIP: Optimize catalog queries

Erotemic commented 3 years ago

This build on the reformatting done in #57 but this time with functionality changes.

I noticed that the simple act of querying the csv files was O(N), and took about 1 minute per query. This was causing me a huge bottleneck because I'm downloading a bunch of LC and S2 data, so often I run queries that return no results. But each of those queries takes over a minute!

To mitigate this issue, I created an sqlite3 cache of the CSV file. When a query happens it checks to see if a corresponding sqlite file exists and if its creation timestamp is after the csv file. If it does not exist, or the CSV was updated, the sqlite file deleted and rebuilt. This does 1 pass through the CSV and takes about a minute. The same as one query in the previous code.

The benefit is now if you do more than one query, and the sqlite file does not need to rebuilt the queries are an order of magnitude faster. It takes about 1 second on my machine.

This code is currently pretty dirty. But if #57 is accepted, I'll rebase and clean this up.

Erotemic commented 3 years ago

Note, I also added a CI script, but it requires approval to run. But I think the tests might be too heavy duty atm to run on CI efficiently.

Erotemic commented 3 years ago

Ok, this should be ready. The tests are running and passing. Here is a list of things I did:

I normalized strings so double-triple-quotes are used for docs and single-quotes are used for code. (I can switch it to be the other way if you'd like, I wasn't sure what style you prefer, but I wanted to make it consistent across the repo)
I made the SQLite3 caching step 1.5x faster by replacing the cvs module with some manual code
I used the ubelt.download utility instead of the one in utils to simplify the logic
I tweaked some of the progress bars
I made a default "fels" cache dir in a standard location for the default download location (i.e. ~/.cache/fels $XDG_DATA_HOME/fels %APPDATA%/fels or ~/Library/Caches/fels)
I removed print / debugging text
I added some docs
Expanded the setup.py and requirements.txt files
I added CI
Added a "fels" main cli (if you pip install -e . in the repo you get a fels CLI tool now).

There is a small outstanding issue, that I wasn't able to figure out how to resolve. SQLite3 seems to make data queries inclusive when you use BETWEEN date(?) AND date(?) and I wasn't sure how to default to exclusive to maintain functional equivalence. You can still use --use-csv to query the tile database the old O(N) way (but it's minutes slower per query, the new O(log(N)) sql query is very quick).

Other than that I can change anything you'd like, but IMO this PR is looking pretty good.

vascobnunes / fetchLandsatSentinelFromGoogleCloud

WIP: Optimize catalog queries #58