vascobnunes / fetchLandsatSentinelFromGoogleCloud

Find and download Landsat and Sentinel-2 data from the public Google Cloud
https://vascobnunes.github.io/fetchLandsatSentinelFromGoogleCloud/
MIT License
52 stars 23 forks source link

WIP: Optimize catalog queries #58

Closed Erotemic closed 3 years ago

Erotemic commented 3 years ago

This build on the reformatting done in #57 but this time with functionality changes.

I noticed that the simple act of querying the csv files was O(N), and took about 1 minute per query. This was causing me a huge bottleneck because I'm downloading a bunch of LC and S2 data, so often I run queries that return no results. But each of those queries takes over a minute!

To mitigate this issue, I created an sqlite3 cache of the CSV file. When a query happens it checks to see if a corresponding sqlite file exists and if its creation timestamp is after the csv file. If it does not exist, or the CSV was updated, the sqlite file deleted and rebuilt. This does 1 pass through the CSV and takes about a minute. The same as one query in the previous code.

The benefit is now if you do more than one query, and the sqlite file does not need to rebuilt the queries are an order of magnitude faster. It takes about 1 second on my machine.

This code is currently pretty dirty. But if #57 is accepted, I'll rebase and clean this up.

Erotemic commented 3 years ago

Note, I also added a CI script, but it requires approval to run. But I think the tests might be too heavy duty atm to run on CI efficiently.

Erotemic commented 3 years ago

Ok, this should be ready. The tests are running and passing. Here is a list of things I did:

There is a small outstanding issue, that I wasn't able to figure out how to resolve. SQLite3 seems to make data queries inclusive when you use BETWEEN date(?) AND date(?) and I wasn't sure how to default to exclusive to maintain functional equivalence. You can still use --use-csv to query the tile database the old O(N) way (but it's minutes slower per query, the new O(log(N)) sql query is very quick).

Other than that I can change anything you'd like, but IMO this PR is looking pretty good.