Regarding Catalog Optimal Output File Format

JPGlaser commented 9 years ago

Hey Guys,

So I am noticing that as I generate larger datasets from larger FoVs, CatSim's write_catalog command seems to be running rather slow over large sets of observations. I have begun creating a catalog writer that stores each observation's instance catalog into a table in a SQLite3 database. However, the thought occurred to me that I should probably ask to see if such a command already existed within the InstanceCatalog class. Is this the case?

~ Joe

rbiswas4 commented 9 years ago

Hi Joe,

So I am noticing that as I generate larger datasets from larger FoVs, CatSim's write_catalog command seems to be running rather slow over large sets of observations.

yes, this is something I run into as well, and I think is expected to be there. So it is helpful to do a number smaller FoVs covering the desired FoV.

I have begun creating a catalog writer that stores each observation's instance catalog into a table in a SQLite3 database. However, the thought occurred to me that I should probably ask to see if such a command already existed within the InstanceCatalog class. Is this the case?

Do you mean you are trying to do just that by storing your instance catalogs for smaller fields of view in a sqlite3 database statement for convenience of using them? I don't think there is such a function, but @danielsf would know better.

danielsf commented 9 years ago

Hi all,

1) The photometry mixins are slow. This is something we are going to try to fix in the next month or so.

2) There is not currently a method to output InstanceCatalogs as sqlite databases. @rbiswas4, do I remember correctly that you put something together for your SNIa catalogs? Should we try to make it an "official" part of InstanceCatalog?

-- Scott

JPGlaser commented 9 years ago

Hey @rbiswas4,

This might give you a better idea of what I am up to: https://github.com/JPGlaser/LSST_CatGen/blob/master/Tests/genLSST_AGNData.ipynb

Basically, I want to automate the process of generating instance catalogs for observed AGN in a selected set of unique pointings over the 10 year period of LSST. The end result is about 8338 observations for 9 unique pointings, which contain a total of 624 AGNs.

It takes about 10 hours to create all of these files. I suspect this is due to the fact that opening the files and writing them via python is the bottleneck, which is why I was curious to see if there were any quicker methods out there. Considering I will be increasing the FoV to at least a radius of 1 deg from 0.05 deg very soon, longterm storage of both observation meta-data and the instance catalogs is something of a concern.

~ Joe

EDIT: Thanks for the comment @danielsf. So the likely bottleneck lies in the photometry mixins?

rbiswas4 commented 9 years ago

@danielsf

2) There is not currently a method to output InstanceCatalogs as sqlite databases. @rbiswas4, do I remember correctly that you put something together for your SNIa catalogs? Should we try to make it an "official" part of InstanceCatalog?

I had forgotten that :) . It is so unlikely that I would be writing to a database that I did not think I might have done it.

https://github.com/rbiswas4/SNIacatalogs/blob/cadence_notebook/examples/FindLCsFromOpSim.ipynb

rbiswas4 commented 9 years ago

I don't know how much of the bottleneck is in 'python'. Even though the time is taken by the write method of the instance catalog class, this is really where the database is being queried. And increasing the field of view simply means querying through more rows.

danielsf commented 9 years ago

I agree with Rahul. I just did a quick test with a field of view of radius 0.05 degrees (which took 4 seconds) and a field of view of radius 0.1 degrees (which took 60 seconds). The amount of time spent in the mixins quadrupled from 1 second to 4 seconds. The amount of time querying the database when from 1 second to 57 seconds. The galaxies in the database are pretty dense, so you're pulling in a lot of them for any field of view.

There might be a way to speed up the database query by cutting out objects that do not have AGN filenames specified at the SQL level. It will require some patches to the code, though. Give me a day or two.

danielsf commented 9 years ago

Actually, I was wrong. You might be able to speed it up now. Try changing your write_catalog statement so that it says:

yourCatalog.write_catalog('yourCatalogName.txt', constraint='sedname_agn is not NULL')

The constraint kwarg adds an extra constraint to the SQL query. sedname_agn is what the agn SED name is called on fatboy. Applying the "it must have an AGN attached to it" cut at the SQL level, rather than at the python level with the cannot_be_null variable in InstanceCatalog, ought to be faster.

Let me know if that does not work.

JPGlaser commented 9 years ago

Thanks @rbiswas4, I will look into your ipynb for a possible way to store the set of instance catalogs in a database for long-term storage.

Thanks to you as well @danielsf. I will try that out with the next run later today to see if it does in fact speed stuff up. Also, is there a way run a loop with write_catalog inside it in parallel? Might be a good thing for us to look into to speed up writing large groups of instance catalogs for independent observations.

~ Joe

JPGlaser commented 9 years ago

Hey Scott @danielsf, Was constraint a recent addition to the lsst_sims package, because I am currently returning this error:

SQLRules = '(sedname_agn IS NOT NULL) AND (agn_sfu < 24.0)' variableAgn.write_catalog(WorkingDir+'/'+CatFileName, write_mode='a', constraint=SQLRules)

TypeError: write_catalog() got an unexpected keyword argument 'constraint'

~ Joe

danielsf commented 9 years ago

sorry. My mistake. 'constraint' is a kwarg for InstanceCatalog's init. So you need to pass it in when you instantiate the catalog.

JPGlaser commented 9 years ago

Yeah, I just found it. Also it seems running that the catalog with: SQLRules = '(sedname_agn IS NOT NULL) AND (magnorm_agn < 24.0)'

Makes a very good cut and reduces the time dramatically. On a 0.1 radius aperture, one catalog of 15 objects is created in 0.38 seconds with the full SQLRules. Getting rid of the magnitude cut, means that the catalog of 272 objects is created in 1.74 seconds. Which means there is a time-saver of 0.0189 seconds per object written. Given that I am writing several thousand files, this may be helpful in when we increase the FoV.

~ Joe

JPGlaser commented 9 years ago

Just ran the code with the SQLRules value described above for a FoV of 1.75 degrees in radius (aka: LSST's full FoV). It took 34.082017 seconds to generate one observation of 5589 AGNs with mags under 24.0. So, it should take 3.291 days running on this iMac to generate a dataset of 9 unique pointings cover a 10x10 square degree swatch of the night-sky for the entire 10-year span of observations.

This is a significant improvement, so it seems that SQL querying is likely the best long-term way of dealing with large data-sets. Maybe we should look into storing the catalogs in a similar manner. My current idea is to make a database of object classes, within which there are time-series tables for each object observed. Each row then is an observation of an object which includes observation-meta data as well as observed magnitudes, errors, etc.

This sounds reasonable, yes?

~ Joe

connolly commented 9 years ago

Have you looked at what the sql is doing on the msqlserver side. If I understand your numbers a 4-fold increase in area is a 60-fold increase in database access time. This seems odd as I wouldnt have expected the database to scale much worse than the raw IO. Is it the query or the manipulation of the galaxy tiling (i.e.if you took a pointing within a single tile of the galaxies is it any faster)

cheers Andy

On Tue, Jul 21, 2015 at 12:43 AM, Joseph Glaser notifications@github.com wrote:

Just ran the code with the SQLRules value described above for a FoV of 1.75 degrees in radius (aka: LSST's full FoV). It took 34.082017 seconds to generate one observation of 5589 AGNs with mags under 24.0. So, it should take 3.291 days running on this iMac to generate a dataset of 9 unique pointings cover a 10x10 square degree swatch of the night-sky for the entire 10-year span of observations.

This is a significant improvement, so it seems that SQL querying is likely the best long-term way of dealing with large data-sets. Maybe we should look into storing the catalogs in a similar manner. My current idea is to make a database of object classes, within which there are time-series tables for each object observed. Each row then is an observation of an object which includes observation-meta data as well as observed magnitudes, errors, etc.

This sounds reasonable, yes?

~ Joe

— Reply to this email directly or view it on GitHub https://github.com/lsst/sims_catUtils/issues/16#issuecomment-123095278.

danielsf commented 9 years ago

I ran the test again and, while the bigger field of view was still slower, it was not 60 times slower. Sometimes things are slower when you run them for the first time. I'm not sure if this is python initializing itself or some extra overhead when first connecting to fatboy.

Regardless, it is not as bad as I made it sound.

connolly commented 9 years ago

That is probably because the database is caching the outputs.

On Tue, Jul 21, 2015 at 5:12 PM, danielsf notifications@github.com wrote:

I ran the test again and, while the bigger field of view was still slower, it was not 60 times slower. Sometimes things are slower when you run them for the first time. I'm not sure if this is python initializing itself or some extra overhead when first connecting to fatboy.

Regardless, it is not as bad as I made it sound.

— Reply to this email directly or view it on GitHub https://github.com/lsst/sims_catUtils/issues/16#issuecomment-123388023.

lsst-sims / legacy_sims_catUtils

Regarding Catalog Optimal Output File Format #16