Open gijzelaerr opened 5 years ago
Note that @bartscheers probably disagrees and still thinks source association and lightcurve building should happen in the database since the skymodel doesn't fit in memory. I disagree, since doing everything in the database is the one biggest bottleneck and slowdown in the whole design.
How many sources do you expect to extract from your images? How many frequency bands will you process? what is your integration interval?
The Amsterdam Transient team are currently planning the next stable release and are working on our long term plans. Indeed one open issue is speeding up source association for real time systems and we are exploring two options:
As I advised to Adam, please do get in touch with the Amsterdam Team so we can coordinate our efforts and ensure we are not duplicating work. Thanks!
One of the issues with TraP and banana was also that setting up, configuratring and maintaining a database adds a lot of complexity to the workflow for a non-computersciency astronomer. Since the databases are single-user mostly, a client-server model doesn't make much sense. I would investigate a file-based database system.
A fork from MonetDB has this architecture: https://github.com/cwida/duckdb
but I don't know how stable it is.
If the architecture is right, a lot of speedups can be gained from parallelisation and pipeline optimisation. The database platform choice then becomes less important and you might even consider some more proven technology like pytables:
@timstaley do you have some more fresh post-trap-developer thoughts now you have been away from the project for a couple of years?
While single users are how it has been operated in the past, this is not the future design. With ASTRON, we are developing a publicly available database for processed datasets over the next few years and a fully interactive system by the SKA. Therefore, for future proofing, we do need to keep the client server design.
That sounds very ambitious, hard to scale and maintain. I would keep things smaller, collect small datasets and integrate that later into bigger databases. That way the project is much more likely to succeed. But hey, I'm not involved with the design anymore! good luck.
@gijzelaerr thank u for the quick answer. Here are my answers:
That sounds amazing. Radio only or also other freqs?
Honestly I don't know, I need to check with the astro guys.
Do you want to fork it or contribute to the code?
Possibly I'd prefer if this would be a group effort since we could help each other with our respective skills. That would also reflect into the project with a much much bigger impact.
Are you in contact with any other astronomers who have been involved?
Nope.
Did you read my TODO document?
Yep and I come up with your same conclusion:
if I was you I would start over with a fresh Python 3 project and cherry-pick the elements that look usable. There is a lot of evolutionary code in the TKP repo that is not required anymore.
Restructure database setup since it is too complex now.
A complete restructure seems necessary to me, because such structure does not marry well with the one of a web app. In the specific, the web app has to manage multiple databases (created per users or projects) and their login details,.... not practical at all. I have few ideas that perhaps would make the marriage between the pipeline and the webapp more manageable, including run the pipeline from the app itself.
How many sources do you expect to extract from your images? How many frequency bands will you process? what is your integration interval?
I don't have an answer to that, need to speak with my astro guys. Hi @AntoniaR, thank you for reaching out. Here is my thoughts about your post
Migrating to source association in python
I agree, would make more sense, and then store the end result (flux curve or whatever)
Return to supporting MonetDB which has been substantially sped up for BlackGEM and MeerLICHT and would have multi wavelength search benefits.
Not sure about this, MonetDB seems to give fast queries due to its columns oriented architecture, but I don't think that a database is the future, at least for the computed data. My idea is that projects, jobs, images are all stored in tables in a single database with only the details of the path where they sit (e.g. Project X, /PATH/TO/PROJECTX, description, tags, other metadata), and the actual data (including the images thumbnails/cuts/cropping, possibly compressed e.g. gz) in a file (e.g. HDF5 or parquet) inside the job folder. Additional tables that track the sources, etc will also be present, with details of the jobs path, so the pipeline knows were the actual data reside, in order to extract them and run the future jobs (not sure if I made myself clear).
Also another detail of our images is that the sources are already extracted, in the header of the TIFF files, so we want a flag to turn the extraction off for that particular run of the project. I don't see this as big deal as is just an extra feature.
In general I'm of the ideas of one database with multiple table that point to actual files/images on the disk/CLoud storage (e.g. Amazon S3), so the whole architecture is scalable and web app friendly, as the web app ORM can be incorporated in the pipeline database, and in the future became one fully integrated solution for both pipeline jobs and result analysis.
Not sure what are your thoughts. Open to discussion. Cheers, Serg
@AntoniaR btw if could send me an email on sergio.pintaldi@sydney.edu.au and put me in contact with the rest of the guys, I'd appreciate it!
@AntoniaR are you guys moved/moving/planning to move to Python3?
Hi all!
If I had one piece of advice, it would be this: Don't try to build one mega-monolithic project that satisfies all use-cases. You will never even come close.
Instead, attempt to carefully consider and isolate the standalone 'units-of-complexity' that can be turned into libraries for separate testing and re-use. This in turn will force you to consider and design some sensible 'interface standards' for shuffling data around between disparate pieces of code.
From memory, probably the biggest candidate for that here is the 'source-associate via DeRuiter distances' code, since you already have PySE separated out somewhat (although having PySE as a standalone package would be even better). You could also potentially split out the 'lightcurve is constant / transient' routines. But honestly, those blocks may be too large or too small, I can't remember.
In short, my 2p is that you don't need a Trap-NG. You need a bunch of documented, tested libraries that can be arranged together in fairly short order to produce a 'Trap-NG UvA' or 'Trap-NG Sydney' or 'Trap-NG my_phd_peculiarities'
Unfortunately, making code clean, tested, packaged and reusable is a lot more up-front work that simply throwing it all together to get to your end-goal as fast as possible. It pays for itself in the long run, however. You can also set more achievable goals for a single PhD student / postdoc / staff software engineer, since they don't necessarily have to support the entire mega-project - just their own little library (cf voevent-parse
!).
As ever, there are no easy answers. Science is hard. Software is hard. Scientific software is fractally complex - which is why you should only attempt to tackle it in small, bite-size pieces. Good luck!
pyse is a separate package already! :)
https://github.com/transientskp/pyse
And yes I agree with Tim, it makes a lot of sense. Thanks Tim!
Ah right, I had a suspicion that might be the case but memory flipped a bit somewhere ;)
meanwhile, MonetDB launched a new product called MonetDBe-Python!
https://github.com/MonetDBSolutions/MonetDBe-Python
This software moves away from the client-server model and is similar to the previously mentioned DuckDB.
Maybe some useful discussion in here for the TraP redesign in R7
Originally reported here:
https://github.com/transientskp/trap-ng/issues/1#issuecomment-538239319
you're welcome.
That sounds amazing. Radio only or also other freqs? Do you want to fork it or contribute to the code? Are you in contact with any other astronomers who have been involved?
Great. That is what is needed.
so here are the questions:
Yes and no. The django app came after, and is just made to visuale te results
It is already.
If you store everything in a flat table you will get a lot of data duplication.
This architecture is the result of evolution. I think the biggest reason is speed. A database becomes extremely slow after processing 10K images or so, so you need to make a new database. For this reasons and others, i'm of the opinion a new version of trap should not use the database for source association but keep an in memory model to this, only to keep the final light curve in the permanent storage (database).
Maybe. I've been playing around with that the last week I worked for the uni of amsterdam. That source finder is optimised for optical data though, and pyse is oriented for radio data.
Did you read my TODO document?
https://github.com/transientskp/tkp/blob/master/TODO.md
In short what I think needs to happen is:
If I was you I would start over with a fresh Python 3 project and cherry-pick the elements that look usable. There is a lot of evolutionary code in the TKP repo that is not required anymore.