rhayes777 / PyAutoFit

PyAutoFit: Classy Probabilistic Programming
https://pyautofit.readthedocs.io/
MIT License
60 stars 11 forks source link

Database Use Still Suspiciously Slow #927

Closed Jammy2211 closed 9 months ago

Jammy2211 commented 9 months ago

I have uploaded a database from lensing analysis here:

https://drive.google.com/file/d/1AYB6j1jBjEgHNa0g1fvVInY0Accy5dk4/view?usp=sharing

I have tried to do a few simple operations:

"""
Database: Samples
=================

In the script `autolens_workspace/*/advanced/database/start_here.py` we performed a fit which fitted 3
datasets and stored the results in a sqlite database.

In this example, we'll load results from this database and show how to manipulate the non-linear search's samples,
for example to inspect the maximum log likelihood models or get errors on parameters.

__Samples via Result__

A fraction of this example repeats the API for manipulating samples given in the
`autogalaxy_workspace/*/results/examples/samples.py` example.

This is done so users can directly copy and paste Python code which loads results from the database and manipulates
the samples.
"""
# %matplotlib inline
# from pyprojroot import here
# workspace_path = str(here())
# %cd $workspace_path
# print(f"Working Directory has been set to `{workspace_path}`")

from os import path
import autofit as af
import autolens.plot as aplt

"""
__Files__

In the `start_here.py` script, we discussed the `files` that are output by the non-linear search. The 
following files correspond to the information loaded when loading the non-linear search samples from the database:

 - `model`: The `model` defined above and used in the model-fit (`model.json`).
 - `samples`: The non-linear search samples (`samples.csv`).
 - `samples_info`: Additional information about the samples (`samples_info.json`).
 - `samples_summary`: A summary of key results of the samples (`samples_summary.json`).
 - `covariance`: The inferred covariance matrix (`covariance.csv`).

The `samples` and `samples_summary` results contain a lot of repeated information. The `samples` result contains
the full non-linear search samples, for example every parameter sample and its log likelihood. The `samples_summary`
contains a summary of the results, for example the maximum log likelihood model and error estimates on parameters
at 1 and 3 sigma confidence.

Accessing results via the `samples_summary` is much faster, because as it does reperform calculations using the full 
list of samples. Therefore, if the result you want is accessible via the `samples_summary` you should use it
but if not you can revert to the `samples.

__Database File__

The results are not contained in the `output` folder after each search completes. Instead, they are
contained in the `database.sqlite` file, which we can load using the `Aggregator`.
"""
database_file = "base.sqlite"
agg = af.Aggregator.from_database(filename=database_file)

agg_no_subhalo = agg.query(agg.search.name == "light[1]_light[lp]")

"""
__Max LH__
"""
ml_instances = [samps.max_log_likelihood() for samps in agg.values("samples")]

"""
__Bayesian Evidence__

"""
print("Log Evidences: \n")
print([samps.log_evidence for samps in agg.values("samples")])

However, this took over 20 minutes to run!

I suspect this is because the model is a a list-based model with many components (e.g. https://github.com/rhayes777/PyAutoFit/issues/925), as I think run times have been okay for much simpler models.

rhayes777 commented 9 months ago

This is in part due to the number of objects in the database. There are 432259 objects. We create a graph where an object is joined to its parent (e.g. a mass profile to a galaxy) using a foreign key. Normally this is made more efficient by adding an index on the foreign key and the primary key (e.g. the mass profile's parent id and the galaxy's id) so that they can be joined more efficiently.

When using Postgres SQLAlchemy does this implicitly. However, in the database file you sent these indexes are not defined. It looks like they can be force created by changing the SQLAlchemy code that creates the column.

In the mean time, you can test what difference this makes by creating indices yourself:

sqlite3 base.sqlite
CREATE INDEX idx_parent_id ON object(parent_id);
CREATE INDEX idx_object_id ON object(id);
CREATE INDEX idx_value_id ON value(id);

I didn't do a before test but running after these changes the script finishes in a reasonable length of time (albeit with an unrelated error).