Open Jan-Willem opened 3 months ago
Here is an IPython notebook that compares the performance of casa-formats-io and python-casacore:
IPython notebook: casa_formats_io_vs_python_casacore.ipynb
Dataset (3.36 GB): VLASS3.2.sb45755730.eb46170641.60480.16266136574_spw10_split.ms.zip
On my Mac M3, casa-formats-io takes approximately 11 seconds and python-casacore takes approximately 3 seconds to read all of the main table data. Initial tests show that the time taken to read the data and perform reshaping (using np.fromfile and casa_formats_io._casa_chunking._combine_chunks) is comparable between the two libraries. Therefore, the performance difference is likely related to how the data gets organized.
Thanks, this is very useful! I haven't really done any performance optimisation in casa-formats-io at this point so I am sure there is a lot of low hanging fruit. I will have a think about the requirements and will follow up soon.
@astrofrog, any update on your thoughts about the requirements?
Sorry for not getting back to you sooner, I was off work for a significant fraction of the summer. I have had a chance to think about the requirements you mention, and have a few follow-up questions/comments.
First, do you need to be able to access just part of the data, or would you always load an entire column into memory?
Second, you mention 'Single-threaded (no Dask)' - note that it is possible to use dask in single-threaded mode, so just to make sure we are on the same page, do you object in general to making use of the dask API (specifically the fact that the astropy table we currently return has dask arrays that require .compute()
to be called) or are you ok with this as long as the computation is single-threaded? The main purpose of using dask is to be able to lazily load part of the data with a nice API.
The high-level API I was striving for here aims to completely hide away the details of a table to a user, and they would inspect the table using e.g. len(table)
and use the standard astropy Table API to get column information and so on. However, that high-level API may not be suitable for all.
If the use of the dask API is a deal breaker, maybe we could agree on a public lower-level API that both you and the dask interface could use.
@astrofrog, we now have some developer time available and plan to start looking into this. Have you had a chance to consider it further?
@Jan-Willem - sorry for the delay, I'll try and reply tonight! I'm sure we can find a way forward to avoid duplicating efforts, I'll write up some thoughts/suggestions this evening.
In any case, one thing that definitely needs doing is documenting the format, so that would be worthwhile starting if you have immediate time available.
I spoke with our developers and we agree that having a clear specification of the layout of the components of casacore tables would be very useful. While significant documentation exists for the table system, it falls short by being neither formal enough nor specific enough for a new implementation to be created based solely on its description. We will begin work on a specification. Our plan is to use Kaitai Struct to create a specification of casacore tables.
Kaitai has a number of advantages:
@Jan-Willem - Kaitai Struct seems nice, I didn't know about it! Just to understand, would this also be used to generate the actual parsing code? Is the idea that this parsing code would then be wrapped by a slightly higher level API that would then be the public API, or would the generated code be the public API? I suspect it would be the former as for example I doubt the generated parsing code would understand what to do with say incremental columns and so on, and so we'd need a layer that transforms the very low level parts of the file into meaningful e.g. numpy arrays and so on?
Would it make sense to collaborate on this under this present repository or a separate one in radio-astro-tools since ultimately the specification is language-independent and could potentially be used in parsers for other languages? For instance, we could make another repository called casa-formats-specification
which would be for the specification file and generated docs, and then once we have something working for simple cases we can have a discussion about how to proceed for the API on top of that specification? If you are happy with a separate repository, I'm happy to set it up, and can potentially add some of your developers as maintainers. I think it would also be good to make sure we follow best practices from the start and set up a rudimentary CI, and make contributions via pull requests (I'm happy to help review these in a timely fashion)
I do think we should use Kaitai to generate the parsing code since that will validate that we have recorded the schema correctly and reduce some work we would have to do.
A separate repository casa-formats-specification sounds like a good idea, and I agree we can discuss the API and integration into casa-formats-io once we have a rudimentary implementation. Having some CI and doing the work using pull requests sounds good.
Can you please add the following people as maintainers:
Done: https://github.com/radio-astro-tools/casa-formats-specification - I will set up a bare bones CI job that doesn't do much for now
As mentioned in https://github.com/radio-astro-tools/casa-formats-io/issues/63 here is what the NRAO would be interested in working on (casa-formats-io might already have some of these features):
ct.nrows()->int #Number of rows.
ct.get_key_words()->dict #Get the value of all table keywords. See https://casacore.github.io/python-casacore/casacore_tables.html#casacore.tables.table.getkeywords
ct.get_column_descriptions(col_names:list of str)->dict #Dictionary where the keys are the column names and the values are column descriptions. See https://casacore.github.io/python-casacore/casacore_tables.html#casacore.tables.table.getcoldesc.
ct.get_columns(col_names: list of str)->dict #Dictionary where the keys are the column names and values are n-dimensional arrays. Should throw an exception when the selected rows don’t have consistent dimensions. For example, selecting the DATA column in the main table and the rows span multiple DDIs. Note that selecting rows that span multiple DDIs for columns that have constant shape should be allowed, for example, SCAN_NUMBER, STATE_ID, DDI, etc.