timj / aandc-fits

Paper on the FITS format for Astronomy and Computing journal
4 stars 10 forks source link

2.4.4. More detail/rewording needed for compression problem #25

Closed brianthomas closed 10 years ago

brianthomas commented 10 years ago

Hi, section 2.4.4. is interesting, but I am worried that we dont have enough support here for why compression in FITS is unsuitable. The FITS community largely feel that this is one of the details that FITS gets 'right'. From my reading of this section, what I think is being advocated here is not that tiled (rice) compression is bad, but rather that its only implemented in a convention, and other parsers/readers wont understand it, correct?

If thats the case, then what the underlying problem is that our readers lack an API which is 'extensible' via a shared plugin mechanism (to point out one possible solution). Do I have that right? Assigning to you Slava since I think you created this section (if not, and you have no opinions, please assign back to me). -brian

timj commented 10 years ago

I think you have it right. The issues not that compression in fits is bad, it is that compression is not part of the standard and from what I can tell support is limited to cfitsio

skitaeff commented 10 years ago

That's exactly right. FITS also dictates the serialisation that makes compression inefficient in terms of speed. This indeed needs some support which is provided in our core JPEG paper. We'll need a reference there. cfitsio has done a great job enabling compression and trying to resolve some related issues with tileing, but one can only do this much being constrained but FITS. In the paper we demonstrate the dramatic difference in decoding images JPEG2000 vs FITS. Nevertheless, I'm going to have another look on that section.

On 27 Feb 2014, at 12:47 am, Tim Jenness notifications@github.com<mailto:notifications@github.com> wrote:

I think you have it right. The issues not that compression in fits is bad, it is that compression is not part of the standard and from what I can tell support is limited to cfitsio

— Reply to this email directly or view it on GitHubhttps://github.com/timj/aandc-fits/issues/25#issuecomment-36146774.

embray commented 10 years ago

I'll add that while PyFITS does support the tile compression convention, it only does so by wrapping CFITSIO. Which was also a bit more troublesome than it ought to have been since I only wanted CFITSIO to touch the data, not the headers (which are managed separately by PyFITS which has a certain amount of internal state to maintain). There's no reason it ought to have been that difficult though. But this is more a problem of software than a failing of FITS itself.

skitaeff commented 10 years ago

There's no way to support tiles and compression without touching the header. It requires many parameters to be recorded to reconstruct the data. FITS only using rather simple compression. It is simply impossible for FITS to handle compression as modern image formats do.

On 27 Feb 2014, at 9:02 pm, Erik Bray notifications@github.com<mailto:notifications@github.com> wrote:

I'll add that while PyFITS does support the tile compression convention, it only does so by wrapping CFITSIO. Which was also a bit more troublesome than it ought to have been since I only wanted CFITSIO to touch the data, not the headers (which are managed separately by PyFITS which has a certain amount of internal state to maintain). There's no reason it ought to have been that difficult though. But this is more a problem of software than a failing of FITS itself.

— Reply to this email directly or view it on GitHubhttps://github.com/timj/aandc-fits/issues/25#issuecomment-36239251.

embray commented 10 years ago

There's no way to support tiles and compression without touching the header.

That is not true at all. There's little to nothing "FITS specific" about all those parameters. In fact CFITSIO reads them all out of the header and into a C struct that it updates during compression/decompression, and only goes back to update the FITS header at the end of the process.

The problem is that the FITS-specific and the non-FITS-specific bits are not as well separated as they could be. That's all.

skitaeff commented 10 years ago

I understand this. I'm only referring to the fact that the technical parameters of tiling/compression can only be recorded in the header in FITS, as they are, along with all other keywords which represent the metadata. Only CFITIO can interpret such a header.

On 27 Feb 2014, at 9:46 pm, Erik Bray notifications@github.com<mailto:notifications@github.com> wrote:

There's no way to support tiles and compression without touching the header.

That is not true at all. There's little to nothing "FITS specific" about all those parameters. In fact CFITSIO reads them all out of the header and into a C struct that it updates during compression/decompression, and only goes back to update the FITS header at the end of the process.

The problem is that the FITS-specific and the non-FITS-specific bits are not as well separated as they could be. That's all.

— Reply to this email directly or view it on GitHubhttps://github.com/timj/aandc-fits/issues/25#issuecomment-36242626.

embray commented 10 years ago

I'm just quibbling over technical details, but PyFITS can also interpret those keywords, and in fact does. It then fills the C struct that CFITSIO manages, all without CFITSIO ever getting its hands on the actual FITS header.

Point being that although the convention was designed for FITS, the software needn't be directly tied to FITS. Unfortunately this ends up being the case for many conventions designed around FITS. Astrodrizzle, for example, has FITS tentacles all over and throughout it, making it very difficult to separate the data models it works on from FITS.

brianthomas commented 10 years ago

Slava, thanks for that text. I saw a few duplications and tried to clean up the argument slightly. I reworded it to the following (below). Please let me know any problems/inaccuracies I created by so doing.

"The large data volumes are directly translated into large cost of the archive and network infrastructures, and the high latency of data retrieval. Effective compression is a must for large datasets.

Currently one of the most common means to get compressed data is to utilize the \texttt{cfitsio} library which supports several types of binary and one type of image compression by convention \citep[see e.g.,][\href{http://ascl.net/1010.002}{ascl:1010.002}]{2000ASPC..216..551P, 2007ASPC..376..483S,2009PASP..121..414P}.

The core problem is then that the FITS standard does not provide any mechanism to store compressed data or images. This is problematic in two ways. First, as mentioned, one must rely on a third party convention and library to get compression. This means that the compatibility for the compressed files is not guarantied as the details of the compression is not part of FITS standard. In the case of a standard like JPEG2000\footnote{\href{http://en.wikipedia.org /wiki/JPEG_2000}{http://en.wikipedia.org/wiki/JPEG\_2000}} all the details are specified as part of a coherent framework. This approach enables many libraries to be developed for the same format. Second, because a convention must be used, the serialization of FITS is not optimal in order to support the advanced compression techniques needed for adequate handling of multidimensional multi-terabyte datasets. Lower level details, when made part of the standard, make a significant difference. For example, {\color{red} \textit{Missing ref} \citealt{2014arXiv1403.2801K}} demonstrate how JPEG2000 outperforms \texttt{cfitsio} in both, compression ratio and image retrieval speed, in some cases by more than two orders of magnitude. "

skitaeff commented 10 years ago

Thanks Brian,

Some comments bellow.

I've also fixed the references. Not sure what happened there.

-Slava

On 18 Mar 2014, at 4:34 am, Brian Thomas notifications@github.com<mailto:notifications@github.com> wrote:

Slava, thanks for that text. I saw a few duplications and tried to clean up the argument slightly. I reworded it to the following (below). Please let me know any problems/inaccuracies I created by so doing.

"The large data volumes are directly translated into large cost of the archive and network infrastructures, and the high latency of data retrieval. Effective compression is a must for large datasets.

Currently one of the most common means to get compressed data is to utilize the \texttt{cfitsio} library which supports several types of binary and one type of image compression by convention \citep[see e.g.,][\href{http://ascl.net/1010.002}{ascl:1010.002}]{2000ASPC..216..551Phttp://ascl.net/1010.002%7D%7Bascl:1010.002%7D%5D%7B2000ASPC..216..551P, 2007ASPC..376..483S,2009PASP..121..414P}.

The core problem is then that the FITS standard does not provide any

"any mechanism" is a bit too strong. cfitsio has demonstrated that it is possible, though the efficiency is questionable due to the FITS serialisation constrains. I'd use "any standard mechanism".

mechanism to store compressed data or images. This is problematic in two ways. First, as mentioned, one must rely on a third party convention and library to get compression. This means that the compatibility for the compressed files is not guarantied as the details of the compression is not part of FITS standard. In the case of a standard like JPEG2000\footnote{\href{http://en.wikipedia.orghttp://en.wikipedia.org/ /wiki/JPEG_2000}{http://en.wikipedia.org/wiki/JPEG\_2000}http://en.wikipedia.org/wiki/JPEG%5C_2000%7D} all the details are specified as part of a coherent framework. This approach enables many libraries to be developed for the same format. Second, because a convention must be used, the serialization of FITS is not optimal

It's not quite right. This problem is on its own, not "second". It is not the convention or lack of standardisation for compression, it is a problem of FITS serialisation. The way how FITS files are structured makes them inefficient for the large multi-dimensional data in general, wether the data is compressed or not. On top of it, it is quite cumbersome to implement compression being constrained by a such serialisation. cfitsio got it working, but, as we demonstrated the efficiency is significantly lower to what one would want for the large data.

in order to support the advanced compression techniques needed for adequate handling of multidimensional multi-terabyte datasets. Lower level details, when made part of the standard, make a significant difference. For example, {\color{red} \textit{Missing ref} \citealt{2014arXiv1403.2801K}} demonstrate how JPEG2000 outperforms \texttt{cfitsio} in both, compression ratio and image retrieval speed, in some cases by more than two orders of magnitude. "

— Reply to this email directly or view it on GitHubhttps://github.com/timj/aandc-fits/issues/25#issuecomment-37866481.

brianthomas commented 10 years ago

Slava writes: It's not quite right. This problem is on its own, not "second". It is not the convention or lack of standardisation for compression, it is a problem of FITS serialisation. The way how FITS files are structured makes them inefficient for the large multi-dimensional data in general, wether the data is compressed or not. On top of it, it is quite cumbersome to implement compression being constrained by a such serialisation. cfitsio got it working, but, as we demonstrated the efficiency is significantly lower to what one would want for the large data.

Hmm. Well thats the sense I got of the problem too, that the lower-level details of the serialization of FITS is a problem which limits efficiency of the compression solution. By 'second' I mean its a second point in the overall argument on how FITS is not treating compression correctly (both in that you must use a convention and that the serialization itself is limiting). Together these both appear to argue (to me) that the compression solution needs to be native to the format (e.g. part of the standard). I guess this doesnt come through well in the rewrite. Can you suggest different wording?

brianthomas commented 10 years ago

Ok closing as this issue is OBE.