Trap NG design discussion

gijzelaerr commented 5 years ago

Originally reported here:

https://github.com/transientskp/trap-ng/issues/1#issuecomment-538239319

Hi @gijzelaerr I solved myself, but thank you for the quick response! Really appreciated.

you're welcome.

I'm in the process of taking the tkp pipeline and banana webapp over, and make it more usable for the the astro-physics groups here at the uni of Sydney.

That sounds amazing. Radio only or also other freqs? Do you want to fork it or contribute to the code? Are you in contact with any other astronomers who have been involved?

Please understand that I'm not deep in the understanding of the whole trap processes, but I have a general understanding of the whole pipeline, step involved (e.g. source extraction, associations - one to one, etc. -, force extraction, etc). Also consider that we are dealing with very big images (min size of 700 MB, approx 30k x 30k pixels).

Take into account that I am familiar with Dask parallelization (I build a pipeline before), and Web Dev in Django (Front end, Back End, ORM, etc.),

Great. That is what is needed.

so here are the questions:

1 - seems that the tkp database schema (ORM) has been design with taking in mind a Django App, is that right?

Yes and no. The django app came after, and is just made to visuale te results

2 - wouldn't be better separate the pipeline from the web app?

It is already.

3 - why don't run the pipeline and store the end result in a flat table? What about a pandas dataframe "friendly" storage solution (e.g. parquet files)?

If you store everything in a flat table you will get a lot of data duplication.

4 - how the user permission on database are handled? Trap uses a different database per project, why is that, don't want to keep all in one database and let Django administrate the user permissions? (e.g. Django have a user table with permission on project, set in the admin page)

This architecture is the result of evolution. I think the biggest reason is speed. A database becomes extremely slow after processing 10K images or so, so you need to make a new database. For this reasons and others, i'm of the opinion a new version of trap should not use the database for source association but keep an in memory model to this, only to keep the final light curve in the permanent storage (database).

5 - the source association mechanism can be simplified, based on the trap-ng (next gen trap) jupyter notebook using Astropy stuff?

Maybe. I've been playing around with that the last week I worked for the uni of amsterdam. That source finder is optimised for optical data though, and pyse is oriented for radio data.

6 - do you have any suggestions about the pipeline architecture for a re-implementation of the pipeline for batch processing of images?

Did you read my TODO document?

https://github.com/transientskp/tkp/blob/master/TODO.md

In short what I think needs to happen is:

Move the source association outside the database
Parallellise independent tasks with something like dask
Restructure database setup since it is too complex now.

If I was you I would start over with a fresh Python 3 project and cherry-pick the elements that look usable. There is a lot of evolutionary code in the TKP repo that is not required anymore.

gijzelaerr commented 5 years ago

Note that @bartscheers probably disagrees and still thinks source association and lightcurve building should happen in the database since the skymodel doesn't fit in memory. I disagree, since doing everything in the database is the one biggest bottleneck and slowdown in the whole design.

How many sources do you expect to extract from your images? How many frequency bands will you process? what is your integration interval?

AntoniaR commented 5 years ago

The Amsterdam Transient team are currently planning the next stable release and are working on our long term plans. Indeed one open issue is speeding up source association for real time systems and we are exploring two options:

Migrating to source association in python
Return to supporting MonetDB which has been substantially sped up for BlackGEM and MeerLICHT and would have multi wavelength search benefits.

As I advised to Adam, please do get in touch with the Amsterdam Team so we can coordinate our efforts and ensure we are not duplicating work. Thanks!

gijzelaerr commented 5 years ago

One of the issues with TraP and banana was also that setting up, configuratring and maintaining a database adds a lot of complexity to the workflow for a non-computersciency astronomer. Since the databases are single-user mostly, a client-server model doesn't make much sense. I would investigate a file-based database system.

A fork from MonetDB has this architecture: https://github.com/cwida/duckdb

but I don't know how stable it is.

If the architecture is right, a lot of speedups can be gained from parallelisation and pipeline optimisation. The database platform choice then becomes less important and you might even consider some more proven technology like pytables:

http://www.pytables.org/

gijzelaerr commented 5 years ago

@timstaley do you have some more fresh post-trap-developer thoughts now you have been away from the project for a couple of years?

AntoniaR commented 5 years ago

While single users are how it has been operated in the past, this is not the future design. With ASTRON, we are developing a publicly available database for processed datasets over the next few years and a fully interactive system by the SKA. Therefore, for future proofing, we do need to keep the client server design.

gijzelaerr commented 5 years ago

That sounds very ambitious, hard to scale and maintain. I would keep things smaller, collect small datasets and integrate that later into bigger databases. That way the project is much more likely to succeed. But hey, I'm not involved with the design anymore! good luck.

srggrs commented 5 years ago

@gijzelaerr thank u for the quick answer. Here are my answers:

That sounds amazing. Radio only or also other freqs?

Honestly I don't know, I need to check with the astro guys.

Do you want to fork it or contribute to the code?

Possibly I'd prefer if this would be a group effort since we could help each other with our respective skills. That would also reflect into the project with a much much bigger impact.

Are you in contact with any other astronomers who have been involved?

Nope.

Did you read my TODO document?

Yep and I come up with your same conclusion:

if I was you I would start over with a fresh Python 3 project and cherry-pick the elements that look usable. There is a lot of evolutionary code in the TKP repo that is not required anymore.

Restructure database setup since it is too complex now.

A complete restructure seems necessary to me, because such structure does not marry well with the one of a web app. In the specific, the web app has to manage multiple databases (created per users or projects) and their login details,.... not practical at all. I have few ideas that perhaps would make the marriage between the pipeline and the webapp more manageable, including run the pipeline from the app itself.

How many sources do you expect to extract from your images? How many frequency bands will you process? what is your integration interval?

I don't have an answer to that, need to speak with my astro guys.
Hi @AntoniaR, thank you for reaching out. Here is my thoughts about your post

Migrating to source association in python

I agree, would make more sense, and then store the end result (flux curve or whatever)

Return to supporting MonetDB which has been substantially sped up for BlackGEM and MeerLICHT and would have multi wavelength search benefits.

Not sure about this, MonetDB seems to give fast queries due to its columns oriented architecture, but I don't think that a database is the future, at least for the computed data. My idea is that projects, jobs, images are all stored in tables in a single database with only the details of the path where they sit (e.g. Project X, /PATH/TO/PROJECTX, description, tags, other metadata), and the actual data (including the images thumbnails/cuts/cropping, possibly compressed e.g. gz) in a file (e.g. HDF5 or parquet) inside the job folder. Additional tables that track the sources, etc will also be present, with details of the jobs path, so the pipeline knows were the actual data reside, in order to extract them and run the future jobs (not sure if I made myself clear).

Also another detail of our images is that the sources are already extracted, in the header of the TIFF files, so we want a flag to turn the extraction off for that particular run of the project. I don't see this as big deal as is just an extra feature.

In general I'm of the ideas of one database with multiple table that point to actual files/images on the disk/CLoud storage (e.g. Amazon S3), so the whole architecture is scalable and web app friendly, as the web app ORM can be incorporated in the pipeline database, and in the future became one fully integrated solution for both pipeline jobs and result analysis.

Not sure what are your thoughts. Open to discussion. Cheers, Serg

srggrs commented 5 years ago

@AntoniaR btw if could send me an email on sergio.pintaldi@sydney.edu.au and put me in contact with the rest of the guys, I'd appreciate it!

srggrs commented 5 years ago

@AntoniaR are you guys moved/moving/planning to move to Python3?

timstaley commented 5 years ago

Hi all!

If I had one piece of advice, it would be this: Don't try to build one mega-monolithic project that satisfies all use-cases. You will never even come close.

Instead, attempt to carefully consider and isolate the standalone 'units-of-complexity' that can be turned into libraries for separate testing and re-use. This in turn will force you to consider and design some sensible 'interface standards' for shuffling data around between disparate pieces of code.

From memory, probably the biggest candidate for that here is the 'source-associate via DeRuiter distances' code, since you already have PySE separated out somewhat (although having PySE as a standalone package would be even better). You could also potentially split out the 'lightcurve is constant / transient' routines. But honestly, those blocks may be too large or too small, I can't remember.

In short, my 2p is that you don't need a Trap-NG. You need a bunch of documented, tested libraries that can be arranged together in fairly short order to produce a 'Trap-NG UvA' or 'Trap-NG Sydney' or 'Trap-NG my_phd_peculiarities'

Unfortunately, making code clean, tested, packaged and reusable is a lot more up-front work that simply throwing it all together to get to your end-goal as fast as possible. It pays for itself in the long run, however. You can also set more achievable goals for a single PhD student / postdoc / staff software engineer, since they don't necessarily have to support the entire mega-project - just their own little library (cf voevent-parse !).

As ever, there are no easy answers. Science is hard. Software is hard. Scientific software is fractally complex - which is why you should only attempt to tackle it in small, bite-size pieces. Good luck!

gijzelaerr commented 5 years ago

pyse is a separate package already! :)

https://github.com/transientskp/pyse

And yes I agree with Tim, it makes a lot of sense. Thanks Tim!

timstaley commented 5 years ago

Ah right, I had a suspicion that might be the case but memory flipped a bit somewhere ;)

gijzelaerr commented 4 years ago

meanwhile, MonetDB launched a new product called MonetDBe-Python!

https://github.com/MonetDBSolutions/MonetDBe-Python

This software moves away from the client-server model and is similar to the previously mentioned DuckDB.

AntoniaR commented 2 months ago

Maybe some useful discussion in here for the TraP redesign in R7

transientskp / tkp

Trap NG design discussion #567