pradosoft / prado

Prado - Component Framework for PHP
Other
186 stars 70 forks source link

TAssetManager ability to publish dynamically generated data/files #835

Open belisoful opened 1 year ago

belisoful commented 1 year ago

In working on the TDot, the need to publish dynamically generate data/file/svg is becoming clear. In Debug mode, the dot SVG can be encoded as data into the image URL but I'd like for the WebControl to publish its dynamically generated and publish SVG in Normal and Performance mode. Currently, the data would need to be saved to a "/tmp" file then copied to "assets", then deleted. That is not ideal.

I think it would be much better to have a new TAssetManager method and associated interface. Here is what I propose:

TAssetManager has new methods publishAsset and writeAsset that works on an interface IAsset for generating the data on demand when needed and only when needed.

    public function publishAsset($data, $checkTimestamp = false)
    {
        if (!($data instanceof \Prado\Web\IAsset)) {
            throw new TInvalidDataValueException('assetmanager_invalid_asset', (string) $data);
        }
        $path = $data->getAssetFilePath();  //This is a virtual file path with file name
        if (isset($this->_published[$path])) {
            return $this->_published[$path];
        } elseif (empty($path) || ($fullpath = realpath($path)) === false) {
            throw new TInvalidDataValueException('assetmanager_filepath_invalid', $fullpath);
        } else {
            $dir = $this->hash(dirname($fullpath));
            $fileName = basename($fullpath);
            $dst = $this->_baseUrl . DIRECTORY_SEPARATOR . $dir . DIRECTORY_SEPARATOR . $fileName
            if (!is_file($dst) || $checkTimestamp || $this->getApplication()->getMode() !== TApplicationMode::Performance) {
                $this->writeAsset($dst, $data);
            }
            return $this->_published[$path] = $this->_baseUrl . '/' . $dir . '/' . $fileName;
        } 
    }

    protected function writeAsset($dstFile, $data)
    {
        $dst = dirname($dstFile);
        if (!is_dir($dst)) {
            @mkdir($dst);
            @chmod($dst, Prado::getDefaultPermissions());
        }
        $dstMod = @filemtime($dstFile);
        if ($dstMod === false || $dstMod < $data->getAssetModificationDate()) {
            Prado::trace("Publishing Data $dstFile", 'Prado\Web\TAssetManager');
            $data->writeAsset($dstFile);
        }
    }

The interface for IAsset:

interface IAsset
{
    /**
     * The virtual file path. eg. "/myVirtualDir/virtualFile.svg"
     * The file path can be dynamically generated and each difference will
     * publish a new file.  This is for for publishing dynamic application assets,
     * like the TDot SVG.
     * @return string the virtual file path.
     */
    public function getAssetFilePath(): string;
    /**
     * The modification date of the asset, Unless the asset changes over time,
     * this can simply return 0.
     * @return int the modification date of the asset.
     */
    public function getAssetModificationDate();
    /**
     * This generates and writes the data of the asset to the $dst.  The path is 
     * typically a web accessible directory, like in the TAssetManager.  This can
     * stream data into the file so as to not take up large amounts of memory.
     * @param string $dst writes the asset to a file.
     */
    public function writeAsset($dst);
}

The way I see this being used is, for example, TDot implementing IAsset, Colors and sizes are encoded into a file name (with sizes being grouped [eg 1..50 is 1, 51..100 is 2, etc]), which is dynamically generated and written to the assets directory, on demand by TAssetManager, where needed.

Off the cuff, there are about 100 million+ combinations of colors and sizes that the TDot could typically take, but less than a dozen will typically be used by an application. Those < dozen should be published Assets, even small < 1k SVGs.

This could provide custom Asset publishing and custom cache busting logic possibly referred to in #534. Someone could make a TImageFilterAsset, where the Asset Path is specified (not virtual) along with the name of the "filter" (eg. sharpen, blur, "sparklize", A.I.-googlie-eyes, eg) and where needed, the writeAsset method would read the JPEG, apply the filter, and publish/write the data to the web accessible assets folder. While this is not "automatic" in the template, it is something that the TPage could do (as in, set up the TImageFilterAsset, and publish it, providing the URL to the controls on the page).

Again, this is not ideal for user based assets but for application wide assets. It can be used to publish user assets, but the asset folder could get very big.

belisoful commented 1 year ago

I pumped out a TFileAsset class which mimics publishFilePath but through publishAsset. subclasses can create unique URLs based upon their parameters, override writeAsset to allow for things like image filters on publish, and custom cache busting logic.

so a TFileAsset subclass that blurs its image file by 50 px would change the file name of the asset from "/path/myImage.jpg" to "/path/myImage.blur-50px.jpg" (if the original asset also needed to be accessed as well).

Otherwise it would just write the filtered data with a file name of the original.

I can see the development of a TImageFilterAsset for applying GD filters to images on publish. That is something that would be wonderful for someone to contribute as this gets tested and committed. It could be used to automatically publish thumbnails images along with the original files and blur NSFW images on a platform (as user content or group indicated), for instance, like Steemit and Reddit.

I can also see the development of a TMultiResolutionAsset that takes an image and publishes multiple resolutions, like a mini, small medium, large, XL, and original size Assets. The configuration and interface for such a class would be very important. Each size would be its own TFileAsset/subclass encapsulated by the TMultiResolutionAsset. The other sizes would be published on IAsset::writeAsset.

belisoful commented 1 year ago

If TTemplate publishes assets usingpublishAsset(new TFileAsset($filePath)) rather than publishFilePath($filePath), behaviors can [conditionally] modify the function of the publish through a class-wide behavior on TFileAsset. TFileAsset being behavior aware, of course.

publishFilePath internals can just be deleted and replaced by return publishAsset(new TFileAsset($filePath)) so all files get the new functionality.

This is very interesting. thoughts?

belisoful commented 1 year ago

I'm making the TFileAsset behavior aware and i think this is going to open up a new dimension for filtering and publishing PRADO assets. eg. a JPG image compressor behavior could be applied to all jpgs so all published jpgs aren't gigantic high res file jpgs but recompressed to only 25% quality on publish.

belisoful commented 1 year ago

When publishing, the new asset writer copies the file to a temporary file in assets, allows behavior filters to work, then renames the file to the destination. With these additions, I have a TJPEGizeAssetBehavior filter, that changes images to jpeg with a configured quality (GD image library required). Obviously, it changes the file extension names too. Interestingly, i put an exception on JPEGizing image files that have names ending in ".full" or ".original" as a bypass.

belisoful commented 1 year ago

BTW, this is now a use case for behaviors within behaviors. There are filter behaviors attached to a publish-image processing behavior.

Very interesting.

Basically, FYI, virtual image assets are being made available. This way, in a template you can refer to <%~ images/myImage.thumb.jpg %> but where the ".thumb" is a virtual asset that is computed resized thumbnail of the original. It publishes the original file passed through the filter and saved as the virtual asset.

I knew this day would come. lol.

belisoful commented 1 year ago

Here's what is going on. TAssetManager has a new method publishAsset that publishes IAsset classes. Strings are assumed to be paths and converted into TFileAsset.

The TAsset is fully behavior aware and is the Base class implementing IAsset and is the parent to TFileAsset.

There are many behaviors: TAssetBlocker, TAssetDuplicate, TAssetRouter, TAssetVirtualize, and, importantly, TAssetImageFilter for doing image processing. The TAssetJPEGize and TAssetPNGize work as image processors.

The TAssetImagerBase[Behavior] has several filters assigned as behaviors. This is behaviors on behaviors.

The Filters for the TAssetImagerBase follow php GD: TAutoCropImagerFilter, TBlurImagerFilter, TConvolutionFilter, TFilterImagerFilter, TGammaImagerFilter, TResizeImagerFilter, and TWatermarkImagerFilter.

The TAssetImagerBase allows for configuration of filters in the application configuration or in a specified file. The modification date is max() with the file modification date. If the filter file gets updated, so do all the assets.

filter files are parsed and cached for speed.

Now for the behaviors that can be added at a class level:

~22 classes. 3 core files: IAsset, TAsset, and TFileAsset. The rest are behaviors changing the behavior of assets and asset publishing. Including an image processor and compressor

belisoful commented 1 year ago

This basically abstracts asset publishing and breaks down each function of publishing into its component parts, ready for manipulation by attached behaviors.

I'm proofing, writing the phpdoc, and the unit tests now.

belisoful commented 1 year ago

Lots of testing. Lots and lots of testing.

belisoful commented 1 year ago

Unit Tests are written and all passing. I did the first proof pass on the core Asset PHPDoc. All is looking great.

The only thing left is the Publishing Image Processor behavior wrap-up and an example virtual Prado asset. The TDot will be included as an example of how to use the new virtual asset publishing functionality. The image Processor, filters, and TDot need unit tests and PHP Doc Proofing.

Getting there.

belisoful commented 1 year ago

My humble apologies for the delay on this. It is more work that I initially thought. The core is mainly complete and documented but is getting a few minor tweaks while working on the Asset Publish Image Processor (resizing, watermarks, format changes, meta data, blur, grayscale, convolution, etc) and it is taking a lot of extra time that I had not thought was needed. I'd like it to be more complete before posting, comment and review. It is an amazing piece of technology. It may not have been on your xmas/holiday gift list, it is on mine.

It breaks down each component of publishing an asset and parameterizes the process.

Here is the computing of the asset file path: 1) set the asset file path in TFileAsset (subclass of TAsset, implements IAsset [mod date, file name, and publish file to path]) 2) the Asset File Path is Pre-Validation filtered and possibly modified by behaviors. 3) File Path is validated 4) the Asset File Path is Post-Validation filtered and possibly modified by behaviors to the stored saved original FilePath. 5) If there is a virtual file path, that is used otherwise the Original File Path is used and it is possibly rewritten/filtered to the final AssetFilePath.

The Final AssetPath is used as the path to publish in TAssetManager. It can be an original, modified original, or a virtual file path, and those can be rewritten by behaviors.

Finally, in the asset publishing there is a final step allowing behaviors to change the final destination file path from the computed AssetFilePath. This final dynamic event is used to convert file types, like jpg/png => webp, and to convert the tar file to its destination published path. [there is a dynamic event for the modification file to allow the TTarAsset to look at the destination md5 for modification time]

I could talk all day about it. It'll be up as the unit tests are finished and documentation wrapped up. Half the imager needs unit tests and 2/3 of the filters need unit tests. Also, the meta data needs unit tests and etc etc as well. MetaData is huge in publishing images. When matching which image files to process, the default is all image file formats it can process. But it's possible to image process just the jpg with meta data that matches specific values. So only post process the publishing of files with Copyright, or if the copyright is a specific year

There is a MetaData image filter for clearing and adding meta data. there is a specific option for clearing GPS image metadata from EXIF (well... it'll be in there shortly. the idea is very solid). So far there are 8 different metadata classes: Common, JFIF, EXIF, IPTC, JPEG, Photoshop, TIFF, and XMP Meta Helpers. XMP is an order of magnitude more complex than EXIF. EXIF is an order of magnitude more complex than IPTC. IPTC is an order more complex than JFIF. XMP will only have minimal support due to it needing a whole sub-system itself. If we want this out in reasonable time, full XMP support will be another time-place. I'm thinking of just wiping the XMP data and going with EXIF and IPTC... worry about XMP synchronization another time.

The IPTC reader and writer is already better than the php iptcParse function and the writer does a better job encoding the values. It will render unit times to respective IPTC dates and time, for instance.

Each day has progress. The unit tests are coming along slowly. I'm finding some odd edge cases that are being addressed.

The core is: IAsset, IAssetFinalizer, IAssetPublishedCapture, TAsset, TAssetEventParameter, TFileAsset, TTarAsset. Also changes to TAssetManager.

The Behaviors so far are: TAssetBlocker, TAssetDiscovery (for changing IAsset Class based on file path), TAssetDuplicate, TAssetImagerBase (core for opening and publishing image processing), TAssetImageFilter (saving TAssetImagerBase), TAssetJPEGize, TAssetPNGize, TAssetVirtualize (for creating virtual files linked to real files, that can then be processed separately, eg by TAssetImagerBase), and TAssetWebPize.

WebP (pronounced "Weppy") was an advanced file format created by Google in 2010. it supports alpha channel, true color, palette color, animation, and lossy and lossless formats. It is supported by all browsers except IE11 at this time 2023. It has better quality and compression than JPEG and PNG. There is a behavior for automatically converting your publishing image assets into this format (with as many options as possible) to save bandwidth.

The support for WebP in PHP-GD is weak at best, but it is there. Better WebP support can be baked in later if/when that becomes now.

The image filter behaviors:

belisoful commented 1 year ago

I was hoping to start tackling some of the other issues by this point. This needs proper attention to get this out the door. Publishing virtual assets like the TDot is a very big advancement in PRADOs publishing abilities.

I spent the last day tackling downconversion of transparency from true color to palette. What a mind bender regarding all the various options of inputs, outputs, and configuration options. There are 8 options for making a true color image into a palette. What a rabbit hole. unit tests are on nightmare mode.

belisoful commented 1 year ago

This is a DEEP DEEP dive. Ok, It turns out that getImageType returns some of the JPEG metadata but not all of it. This is important because there is an Image Filter publishing behavior that reads an image, applies filters (blur, resize, watermarks, etc.) and write it back out (in place, at the destination). the XMP data is found: as in, Duplicate JPEG APPn tags are not produced by getImageType. EXIF and XMP are located at both located in APP1 markers in JPEGs. there are two APP1 markers but only the first gets reported by getImageType.

This is the issue taking so long: Asset Publishing filtered images should keep the meta data. and requires meta data processing be written from scratch.

The JPEG metadata reader/writer needs custom code. So does PNG and WebP. I haven't even started on PNG and WebP reading and writing meta data (though that is easy, PNG has "chunks" and WebP is RIFF format).

Along with retaining Image MetaData, there is a filter for adding, removing, and appending meta data and a part of the filter file matching is matching metadata. Very cool.

Everything about metadata needs to be done from scratch because the built in php methods are not good enough. IPTCparse and EXIF parsing are weak and EXIF cannot be written.

Just so this doesn't take forever, XMP is NOT going to be supported despite getting the data. XMP is going to be removed for EXIF and IPTC data until proper XMP support can be coded later.

Right now, I have JFIF, JFXX extensions (for thumbnails), and IPTC working. EXIF is being decoded.

I have a few unit tests left to write for the Image Reader, finalize the JFXX unit tests, IPTC Unit tests, EXIF writer and unit tests, PNG/WebP metadata, and the MetaData unit tests are still pending. I'm making great progress.

belisoful commented 1 year ago

There should be an option to remove GPS data from EXIF. This is a driving factor in metadata processing: Security.

belisoful commented 1 year ago

I wanted to say: Integrating Image MetaData into Prado is an engineering challenge.

belisoful commented 1 year ago

This is looking to be the most complete JFIF, JFXX, IPTC, JPEG, PNG, WEBP, EXIF metadata handler. Fitting for PRADO. I am most excited for how this is turning out. Delays, but WOW. As big as this release is, 60+ new classes, everything here is being unit tested. It is industrial quality and proofed and proofed and proofed again.

I am hoping this becomes a new key selling feature for PRADO... Configured Image manipulation on publishing is SO SO SO powerful.

belisoful commented 1 year ago

The IPTC is complete and looking great. Very developer friendly and accessible. Meta Searching has one parameter search. It doesn't set metadata yet but that's relatively easy comparably. [That happens with an image filter with configuration.]

I'm looking at a great implementation of EXIF for reading, searching, modifying, and writing. Here is what i'm coming across. EXIF has two data structures that link. One is the container. Most instances it is an array. this can be made a concrete class for reading and writing. The second data structure is the Image File Directory itself, with endian based upon the first structure.

The great news is that knowing this allows for proper differentiation and class structure. here is my thinking:

The EXIF container should provide general accessibility to all fields and as well as specific IFD accessibility. eg "GPS:GPSLatitude"

Many implementations read the data, but few write the data. None of the field meta is held, like field type and field keys. type is usually scrapped and field keys are translated to the names. The spec says that if a field cannot be identified it is still retained. Fields that branch to their own unidentified IFD are an issue. They are encoded as a long with the offset to the data. The data is never read or retained on unknown IFD fields.

My sense is that that is what the "free" fields do: they identify private data chunks? However, they can? move around by reader/writers that have no knowledge of private IFDs, (maybe, still studying the standard). So. I am still at a loss of understanding in how to preserve private IFD and length/count data when an IFD changes. Say a field is added and pushes all the data back by 50 bytes, then the Private IFD is no longer pointing to the proper data location. "Tracking all the data access in the binary to identify what is NOT accessed" is not going to happen. Rewriting a binary around the unaccessed data to preserve location is possible but just sound like a long-shot task with little utility at this point. While the standard discusses private IFDs for ones own application and format, it is not required to be supported in EXIF, at least i don't think so. I'm noodling on this one for now. Any input or experience would be useful.

Interestingly, there are a few things about the spec not found in most implementations. For writing integers, The official spec defines a char or a short, but if the value can only be represented properly in a higher byte count int, then that is used. Also, when an EXIF is larger that 64k, it will not fit in a JPEG marker 64k limit and so they allow it to be extended from APP1 into as many APP2 markers that are needed. I doubt many implementations do this. This is already a long road, so that'll have to be for another time.

XMP metadata is also written to the APP1 tag and is dropped by PHPs implementation of getimagesize. This is not tested, but i think getImageSize only returns the first marker it finds and skips the rest rather than return them in an array under the JPEG-APPn marker. So a custom JPEG marker parsing/writing utility functions are needed. This is done, but i wanted to document this quirk in PHP regarding XMP data.

PHP has no implementation of pack/unpack big endian signed short and, in 32 bit systems, cannot properly implement ulong32. To account for this, signed shorts are post processed with twos complement. I'm not sure what to do with unsigned long. As of this moment, PHP floats are CPU doubles with 8 bytes. when an 32 bitlong is less than 0, but supposed to be unsigned, It can be converted into a float (8 bytes) of the proper value with the two-complement undone, outside the bounds of PHP signed 32 bit long.

Lastly, IFD have a data type called a Rational (and unsigned rational). It takes a float and converts it into two integers, numerator and denominator. so, "71.4" is converted into "714" divided by "10". The algorithm is the continued fraction. https://en.wikipedia.org/wiki/Continued_fraction There are basic PHP implementations of it all over and it is easy to implement. Unit testing IFD-TIF-EXIF is going to be important.

Once EXIF is complete, then to get wrapped up: setting meta data (easy) via filter, #852, searching multiple metadata fields, and the remaining image [unified] metadata unit tests. Oh and the final revision of the error handling and php docs.

Should PNG and WebP support for meta data be implemented in this first iteration? PNG has chunks of data, and many examples of reading EXIF and XMP on the web, but there are some subtle nuances regarding single field values and metadata compression. WebP uses RIFF format (used commonly by audio files). I don't see PNG or WebP metadata adding too much time to development. I may pause development there without PNG/WebP support and submit the code for review and merge.

There is little that hasn't been long thought about.

belisoful commented 1 year ago

Brief update. The Free space doesn't move and data can be written around it [on writing] to preserve private data (IFD) spaces. If the data being written collides with a free space, it is moved to after the free space. Then there are two location to try to write to until there is no room in the first space before the free spaces, then go to the second space

Also, EXIF is a rich source of enhancement for PRADO. I think it will require an IO TDataReader/Writer class(es) that can handle both big and small endian. This will make it easier for any app to read and write binary files. A lot of data and metadata are turning into XML these days. Binary files are still critical.

regardless, PRADO has no binary reader or writer class. It would be beneficial to have such a thing. This is why it is important to do proper implementations in PRADO. We can find the places lacking and fill the holes.

belisoful commented 1 year ago

Metadata is very, very important for some people and Business [and thus all people; "one-ness"]. Having a proper implementation-tech is key for further adoption of the platform. This will be another selling point for PRADO (like Cron and Permissions). At some point XMP is going to need to be implemented for this reason.

there is a lot of opacity behind IPTC, exif and their various implementations. as in, even PHP's iptcparse and exif_read_data isn't clear.

So, here is some clarity in my studies:

Making it easy to get and edit image metadata on a website, both in custom code and asset publishing, is the requirement for PRADO MetaData. use case: 1) inject copyright, Image Comment, or tracker code on file upload or publish. 2) capture [or remove] the image thumbnails (JFXX and EXIF) on upload (eg. for display to the user).

The Imager filter makes use of these metadata facilities to read, write, search, inject and clear metadata. there is a lot of configuration possibilities for ImageFilter Metadata. Very exciting.

The TDataReader and TDataWriter are looking awesome. Easy to use. It meets the stiff requirements I've been able to identify. The unit tests are written. It accounts for a few nuances of PHP, Like PHP pack doesn't have signed 16 bit short (in specific endian) so needs further process for twos complement and how 32 bit systems can't handle int between 2147483648...4294967295. If bits are not preserved, those numbers are converted to 64 bit floats, and 64 bit floats in that range are adjusted to signed int for proper writing. I'm finding it very useful. There may be some regression needed to update the alpha code, like updating the JPEG meta reader/writer to use the new TDataReader/Writer. ho hum.

The EXIF abstraction arrangement is being finalized (TIFF and EXIF specific IFD subclasses) and official spec configuration completed. Utility functions are next; like getting and setting the thumbnail to and from a GD image.

The Meta Data classes should act just like the array output of iptcParse and exif_read_data but have tons of added utility (maintaining proper ordering, getting/setting/deleting image thumbnails, and writing itself). The tagids are embedded as "public const" to avoid the name => tag id conversion in most instances. Values are accessible by tag name as well. Making metadata access consistent, open, and more easily understood.

belisoful commented 1 year ago

Patience.... (a reminder for myself.)

belisoful commented 1 year ago

Quick update. The new code is remarkably small for EXIF given its complexity. It's mainly the organization and configuration that needs time to proper input and abstraction... which is coming together now. Things have been moved around and sorted.

It should be 100% compliant to EXIF (Main Image, read/write JPEG thumb, EXIF, GPS). 95% compliant (% tags; not reading/writing images) to TIFF but 80% in the totality. We could encode various EXIF values as Enum Classes, specific IFD/folder tags for specific camera makers, etc. These will likely end up being an exercise for someone else if they want to make it so complete. I like the idea of it; this needs to be completed in P time not NP time. libexif The infrastruction is there for such custom IFD

This EXIF is written in C and has the best compliance I've seen. https://github.com/libexif/libexif/

belisoful commented 1 year ago

I finally have proper form in the EXIF-TIFF abstraction.

The TDataReader and TDataWriter encapsulates an "fopen" (or other streams) for reading/writing binary data. They are in the "Prado\IO" folder. These classes have some great features that are great to have for PRADO in general. I am very glad to be giving and having this time for such utility classes. Only recently are these classes being finalized.... There are a lot of basic requirements for EXIF.

TDataReader can read signed and unsigned char, short, & int, float, double, string, C-String (NUL terminated string), and can read a specified length or the whole file.

TDataReader has 4 main features: 0) Big and Little Endian support. TDataReader implements both big (Motorola) or little (Intel & ARM) endian, by setting setByteOrderMotorola. The term comes from the output of exif_read_data, for the sake of consistency. 1) pushOffset($nextOffset = null) and popOffset($skip = null) are methods that save/replace and restore the file read/write pointer. This is most utilitarian when jumping around a file in reading/writing data with data pointers. The TIFF standard sets this up. 2) StartOffset and FixedLength properties. A data stream needs to be able to be "narrowed" down on a specific data run with specific qualities. The StartOffset is used to adjust ftell so that the Start of the file at 0 file pointer is at the StartOffset. Use Case, reading JPEG markers is Big Endian, but then the TDataReader needs to narrow the specified data to one JPEG marker (IPTC or EXIF [possibly with little endian]). EXIF measures its data location from the start of its Data Marker in JPEG; as if the EXIF section was isolated by itself. The StartOffset and FixedLength allow sections of JPEG to be read independently and without the need to copy the data to memory to work with it. 3) pushSection($startOffset, $length) and popSection. This saves the Endian, Offsets, StartOffset, and FixedLength for its own section of the file with a new start offset and length. This is to easily isolate/push the EXIF section from the JPEG marker and then pop the EXIF section out and return to the JPEG when done processing the EXIF

The TDataReader push/popSection allow sub-classes to save and then restore data as well. This is needed for the TDataWriter. (see # 6, below)

The TDataWriter does some very nifty things. 0) it writes according to the endian format just like reading does. 1) it has public const MEMORY for php://memory and public const TEMP_FILE for php://temp. This allows developers to more easily work with temporary files and memory. 2) special private data locations can be designated as "no write" zones. Once all the private spaces are registered, the write spaces are initialized with initWriteSpaces (for the TDataWriter alloc method to function properly). 3) The alloc($size) method is for allocating write space to write within the TDataWriter. 4) Each data type can be written with or without first being allocated, the position of writing a data type is returned. EXIF pre-allocs the data it writes, but other systems may not. 5) Writable spaces can have a minimum size (4 is default, based upon the tiff standard). 6) Private Spaces, Write Spaces, minimum write space size, and "allocate each write" are stored into the FIFO stack on push/popSection. So each Section can have its own write variables.

To the EXIF standard. The way I am setting this up is with the folder Prado\IO\TIFF. A TImageFileField hold the data for one field/tag entry. It has a value, a data (value), and an offset. the data (value) is for pointers like StripOffsets where the values are the pointers to the file locations but the data itself needs to be associated. a TImageFileFiled reads and writes its own binary data from a TDataReader/Writer.

TImageFileDirectory implements the basic reading and writing of a directory and field/tags. It is a TMap of TImageFileFields. It looks like a map of values because the TImageFileFields are transparent. If the negative tag is given to TImageFileDirectory, it returns the actual TImageFileField rather than the value. This makes it very easy to access the values but also easy to access the TImageFileField class container itself.

The TImageFile is the container for TImageFileDirectory (IFD), which implements the basics for holding and mapping TImageFileDirectories. This is needed to that we can access the IFD by index (0, 1), by name ('IFD0', 'IFD1'), and by IFD name ('THUMBNAIL', 'EXIF', 'GPS', etc).

With TImageFileField, TImageFileDirectory, and TImageFile defined, we can move to classes for the TIFF standard with the TTagImageFile subclass to TImageFile, and a TTagImageFileDirectory subclass to TImageFileDirectory. The TTagImageFileDirectory implement the actual tags required for TIFF and the TTagImageFile has logic for parsing and writing TIFF files; specifically endian and endian validation of the TIFF. BTW, Tiff can be designated as either Big or little endian, so it must support both formats.

TEXIF is a subclass of TTagImageFile; this adds reading/writing EXIF format then handing things off to the TIFF reader/writer. TEXIFMainDirectory is a subclass of TTagImageFileDirectory; it limits the fields of TIFF and adds three internal IFDs: Interop-IFD, EXIF-IFD, and GPS-IFD.

With three EXIF specific directories & associated tags defined, there are three classes: TEXIFDirectory to contain the EXIF specific tags and names of the "ExifOffset" IFD. TEXIFGPSDirectory contains the tags and names for the GPS IFD. and TEXIFInteropDirectory contains the tags and names for the Interop IFD. Getting to this point should be relatively easy given what is in place. I have a few more tweaks to do on this to get it up to PRADO standards. Still need php doc and unit tests. Those are on the way now that the abstraction is nearly settled.

The last bit is the most difficult. There are EXIF application extensions all over the place There is a SubIFD tag/field for Cannon cameras, there are about 4-7 other custom IFDs designated as such without much info, there are custom fields specific to the Main IFD, specific custom fields in the EXIF, specific custom fields in the SubIFD, and specifics and differentiation based upon the index of the IFD. for instance: fields in SubIFD2 vs SubIFD1, This is the nightmare.

If you want to see how jumbled it is, check this out: https://exiftool.org/TagNames/EXIF.html

I am on the fence about implementing the mass of custom fields in some way. It would be great to have. But, Does this need that much more time? I'm already skimping on PNG and WebP support to complete EXIF in Reasonable Time and only one search parameter for MetaData. (I think it would be great to have, eg, "Copyright == 'myname' && (Year=='2022' || Year =='2021')" syntax for searching and triggering image-manipulation-at-publish at some point. but that may be for another iteration.)

So that is the latest update on what is going on here.

Lastly, There is a TPackBitsCompressor utility in the Prado\IO folder because the TIFF standard does call for it. I doubt it will ultimately be implemented in this iteration, but it does work and I'll probably include it because the unit tests are already written. (they are very easy). It uses preg_match to search for runs of the same character. Someone could take this and do an implementation of TIFF to read and write TIFF. At least that is the goal.

I'd like to wrap this up, merged, and various issues closed. I have a new implementation of TCronModule to finalize and upload as well. and I'm in the middle of working on the new TEmail implementation for Prado. A minor update to the behaviors is needed as well. so yeah. I'd like to get this wrapped up.

belisoful commented 1 year ago

I'm going through ensuring the official specs are in place and I'm finding that there is more than just the GPS IFD that should be removed by default for user security. The EXIF IFD has ImageUniqueID, CameraOwnerName, BodySerialNumber, and LensSerialNumber that should be optionally (default?) scrubbed as well. Due to the use of information as data points to spy on us (Edward Snowden says EXIF is used by CIA's XKeyscore and other spy programs), other information that should be wipe is the MakerNote, LensSpecification, LensMake, LensModel, and DeviceSettingDescription. Some lenses can get very expensive and my sense is that the default on these fields should be privacy.

belisoful commented 1 year ago

A great way of describing EXIF (and how to fit it best into PRADO) is as a 21 dimensional rubics cube. I want this to be solid and not need much further work. A solid final form. The TIPTC class works like a charm, solid and complete, TEXIF should be of similar core usage concepts. An implementation worthy of PRADO, of course.

A few more of the EXIF Tags require custom parsing than in my initial readings of the documentation. This isn't an issue. Just additional testing. The tag/Fields requires a customData property for things like the character encoding of the UserComment.

There is a TEXIFCustomTagsBehavior for adding the additional known application specific tags and directories. This is a data heavy class with lots of configuration. The TEXIF class has a property for if it should be strict or not. Strict Mode doesn't add the "known Application Specific Tag" TEXIFCustomTagsBehavior.

Thus, the EXIF implementation needs to be behavior-aware, which, again, is one of the dimensions to the total of ~21. This complexity is less than if it were implemented in other ways. Also, then developers/users can add their own EXIF custom tags at run-time to the system without sub classing.

With a greater understanding of the implementation of EXIF, the requirements for the TRational are becoming clearer. The TRational is looking nice and has a subclass TURational for unsigned. TURational accounts for 32 bit systems unable to handle high integers in PHP. The continuing fraction handles NAN as [(any), 0] and a special value for INF as [-1, 0] for signed and [4294967295, 0] for unsigned (both being [0xFFFFFFFF, 0]). TRational and TURational was unified with a bit to decide which it was, but splitting it up makes more sense given the slight functionality differences and dependencies (on PHP_INT_SIZE in TURational).

And for as much code as is written (but no test units) it does seem to actually be coming together, finally.

belisoful commented 1 year ago

The memory buffering and allocation for writing TIFF is offloaded into a TDataAllocationBehavior class. It attaches to the TDataWriter. The push/popSection is behavior aware to account for this. TDataReader::fclose also signals to the behaviors to close.

As I am testing out writing data to various offsets, I just ran into this PHP bug: https://bugs.php.net/bug.php?id=52335&thanks=3

Apparently writing to a Memory Stream doesn't work the same as writing to a file stream. fseek past the end of the file adds NUL bytes to the end of the file but the fseek return false for Memory and Temp File.

So. More updates to the TDataReader class to standardize this fseek functionality. This has been an ongoing battle of little gremlins like this. If the TDataWriter is memory or a temp file, fseek past the end will ftruncated-increase the size of the stream to the same effect as it does with a file.

For documentation purposes, When the stream is memory or a temp file, it can fwrite at the end of the stream and it'll grow, but cannot fseek past the end of the file without first ftruncate the stream to a larger size.

Good news, I'm about ready to start working on the unit tests for EXIF. The new utility classes (like TDataReader) are going to be very well flushed out as well.

belisoful commented 1 year ago

I have the fseek bug fix for memory and Temp working in TDataWriter. the hooks are in TDataReader (but does nothing in itself). TDataWriter then checks the stream_get_meta_data()['stream_type'] for MEMORY or TEMP, and manually increases the size of the stream when fseek-ing past the end. Refactoring TDataReader was good though. It's in a better place for it.

I'm looking at the official tiff and exif spec and their related algorithms and I think there may need to be a TBitReader and TBitWriter class for reading and writing a few bits at a time. Compression algorithms (LZW, CCITT 1D, etc) and < 8 Bits per sample per Pixel are going to need such classes. I can see other utility in having such a bit reader/writer class in PRADO for other developers to use.

We want to have such utility compared to other platforms as well.

It is not a behavior for TDataReader/Writer so that custom readers/writers can use the class with a stream of their own choice. The Bit Rr/Wr constructs a TDataReader/Writer wrapper around the stream if it is not one so it can read and write bytes in the context of a TDataReader Section if the stream is already a TDataReader/Writer.

These Bit Reader/Writer classes are very simple; taking in a stream or TDataReader/Writer, the initial location of initial byte 0 (which can be sectioned to stop reading past the specified FixedLength), and a current bit position. The Reader takes the number of bits to read (max: PHP_INT_SIZE * 8, 32 or 64 bits on their respective system) and returns an int with the bits. The Writer takes an int and the number of bits to write. I asked ChatGPT to assist in writing the main core of the bit reader and writer. It needed some nudging but it got there. It was way easier to ask ChatGPT to write the code than to do it myself.

there is only one core function in the Bit Reader and Writer and that is to read bits and write bits. The Writer has a flush to stream out any remaining buffered bits hanging out.

The unit Tests for the Bit Reader/writer are next, then the unit tests for EXIF. Then finalizing the Meta processor on the Image, then LOTS of proofing, documentation, error checking, and checkins. over 60 new classes?!? Most are behaviors, filters, and utility classes for additional functionality out of the new TAssetManager and TAsset class. This is getting there, one step at a time.

(There is an XML configuration for image manipulation on publish. This should work very well for things like generating user icons. The file virtualization feature is banger. And the ability to publish files stored in a database through template file reference is possible too. and, of course, publishing dynamically generated virtual files like TDot.)

I've been noodling on this, I think an uncompressed RGB, palette, gz/packbits (because its so simple) TIFF reader and writer would be easy and actually helpful in proofing the TIFF-EXIF classes as there is something operation to work with in reading and writing. The EXIF JPEG thumbnail IFD is a simplification of an image normally encoded into an IFD within a TIFF. This is just a base level implementation, and basically ripped from the already working JFIF/JFXX implementation.... to have data for TIFF and EXIF testing.

Reading and writing uncompressed RGB in TIFF is easy, so is the palette color map and pixel index data once the TBitReader/Writer is working (less than 8 bit palettes is nice, eg 5 bit palettes to save space). The gz and packbits compress/decompress is in PHP and already complete. The packbits is super simple and implemented with prep_match_all. Tiffs usually use LZW for image compression so the Tiff reader/writer won't be compliant with most standard compressed TIFFs. That'll be an academic exercise for someone else to implement the LZW [de]compressor on the already organized uncompressed data.

My sense is that a Bit Reader/Writer should be included in a 2023 Asset Manager overhaul such as this. And if it is going to be included, it should be used at least once, like for the palette pixel data for 128 colors or less in a tiff.

belisoful commented 1 year ago

Oi. FYI, the fseek in TDataWriter will automatically increase the size of the file to contain the fseek regardless of being a file or memory/temp. This standardizes the behavior of fseek($offset, SEEK_END) between those stream types. Without this, the file sizes differing when fseek SEEK_END past the EOF between files and memory/temp produces different behaviors. By increasing the file size to contain any fseek in TDataWriter of any [seekable] stream, we get consistency.

This was discovered and corrected during Unit tests. yay unit tests.

If the PHP#52335 bug were fixed, no file resizing would be needed (to contain the fseek) and seeking from EOF would be standardized. alas, it's not fixed. This is just documentation for anyone else out there. This seems to be a reasonable and workable solution to the bug if/when/until the bug is fixed.

to summarize: In file streams, an fseek past the end of the file will work, but the file size/eof doesn't change until writing new bits past the end. The new skipped bytes are initialized as NUL. In memory/temp streams, an fseek past the end of the file will fail. By ftruncate to the new larger size (to contain the pointer past eof), the NUL are written out on seek, memory/temp streams do not fail, and the sizes of both types of streams (the eof) stays the same.

Replicating the "enlarge the file with NUL on data write when past the EOF" behavior of files for memory streams would require a ton of bending over backwards.

Any disagreements? suggestions? comments? experiences? I think this is a relatively optimal solution. For public discussion if there needs to be.

aaaaand into the Bit reader/writer unit tests....

belisoful commented 1 year ago

Brief Documentation update.

The TImageFileDirectory returns Values of its field except in the instance of a value being a TXRational.... The TXRational returns its float value rather than returning the TXRational class. The TRational and TURational are then transparent utility classes in regards to access from the Directory (IFD) TMap.

The way to access the TXRational is to get the tag/field from the Directory (IFD), then get the TImageFileField value, which will be a TXRational. Accessing a TXRational field value directly from the Directory (IFD) will return the float result of the TXRational. This avoids using functions having to deal directly with the TXRational class.

The read integer numerator and denominator are retained for writing, and it looks and acts like a float as designed.

This maintains consistency with the TIPTC class and itself to use native types. Ultimately, this will make search easier (not have to code conditions for reading T*Rational values).

Also, I'm doing what I can so that EXIF-TIFF code is easy to maintain.

belisoful commented 1 year ago

Here is the first pass with encoding an image (RGB and Palette) into a TIFF IFD in both a normal format and a JPEG format. You can get a sense of usage and background complexity.

Class TTagImageFileDirectory extends TImageFileDirectory
{
    protected function getImageColorMapData($image, $bitsPerSample)
    {
        if (!$image || imageistruecolor($image) || $bitsPerSample < 1 || $bitsPerSample > 8) {
            return [];
        }
        $color = [];
        $colorMap_r = [];
        $colorMap_g = [];
        $colorMap_b = [];
        $total = imagecolorstotal($image);
        for ($i = 0; $i < 1 << $bitsPerSample; $i++) {
            if ($i < $total) {
                $color = imagecolorsforindex($image, $i);
            } elseif ($i === $total) {
                $color = ['red' => 0, 'green' => 0, 'blue' => 0];
            }
            array_push($colorMap_r, $color['red'] << 8);
            array_push($colorMap_g, $color['green'] << 8);
            array_push($colorMap_b, $color['blue'] << 8);
        }
        return array_merge($colorMap_r, $colorMap_g, $colorMap_b);
    }

    protected function getImageStripData($image, $rowsPerStrip, $rowIndex, $palette, $bitsPerSample)
    {
        if ($palette && imageIsTrueColor($image)) {
            return false;
        }
        $writer = new TBitWriter();
        $sx = imagesx($image);
        $sy = imagesy($image);
        $hasAlpha = $palette ? count($bitsPerSample) == 2 : count($bitsPerSample) == 4;
        $alphaBits = $bitsPerSample[$palette ? 1 : 3] ?? null;
        for ($y = $rowsPerStrip * $rowIndex; $y < min($rowsPerStrip * ($rowIndex + 1), $sy); $y++) {
            for ($x = 0; $x < $sx; $x++) {
                $color = imagecolorat($image, $x, $y);
                $alpha = 0;
                if ($palette) {
                    $writer->writeBits($color, $bitsPerSample[0]);
                    if ($hasAlpha) {
                        $alpha = 127 - imagecolorsforindex($image, $color)['alpha'];
                    }
                } else {
                    $c = imagecolorsforindex($image, $color);
                    $writer->writeBits($this->bitShift($c['red'], $bitsPerSample[0] - 8), $bitsPerSample[0]);
                    $writer->writeBits($this->bitShift($c['green'], $bitsPerSample[1] - 8), $bitsPerSample[1]);
                    $writer->writeBits($this->bitShift($c['blue'], $bitsPerSample[2] - 8), $bitsPerSample[2]);
                    if ($hasAlpha) {
                        $alpha = 127 - $c['alpha'];
                    }
                }
                if ($hasAlpha) {
                    $writer->writeBits($this->bitShift($alpha, $alphaBits - 7), $alphaBits);
                }
            }
        }
        return $writer->freadall();
    }

    public function setImage($image, $options)
    {
        $dataSize = $this->_imageFile->getPixelChunkSize(); // recommended 4k-16k, less than 64k.

        $isTiles = $options['tiles'] ?? false; // unsupported
        $isAlpha = $options['alpha'] ?? false;
        $redBits = $options['redBits'] ?? 8;
        $greenBits = $options['greenBits'] ?? 8;
        $blueBits = $options['blueBits'] ?? 8;
        $alphaBits = $options['alphaBits'] ?? 8;
        $compression = $options['compression'] ?? self::CompressionNone;
        $palette = $options['palette'] ?? !imageistruecolor($image);

        $paletteCount = imagecolorstotal($image);
        $paletteColorBits = self::bitCount($paletteCount);
        if ($palette && imageistruecolor($image)) {
            return false;
        }
        if ($palette) {
            $this[self::ColorMap] = $this->getImageColorMapData($image, $paletteColorBits);
        }
        $this[self::ImageWidth] = $imageWidth = imagesx($image);
        $this[self::ImageLength] = $imageLength = imagesy($image);
        $bitsPerSample = $palette ? [$paletteColorBits] : [$redBits, $greenBits, $blueBits];
        if ($isAlpha) {
            $this[self::ExtraSamples] = [self::ExtraSamplesUnassociated];
            array_push($bitsPerSample, $alphaBits);
        }
        $this[self::BitsPerSample] = $bitsPerSample;
        $bitsPerPixel = array_sum($bitsPerSample);
        $rowsPerStrip = max(1, min($imageLength, floor($dataSize * 8 / ($imageWidth * $bitsPerPixel))));
        $this[self::Compression] = $compression;
        $this[self::PhotometricInterpretation] = $palette ? self::PhotometricInterpretationPalette : self::PhotometricInterpretationRGB;
        $this[self::SamplesPerPixel] = $palette ? 1 : 3 + ($isAlpha ? 1 : 0);
        $resolution = imageresolution($image);
        $this[self::ResolutionUnit] = self::ResolutionUnitInch; // centimeter, millimeter, micrometer.
        $this[self::XResolution] = $resolution[0]; // URational
        $this[self::YResolution] = $resolution[1]; // URational
        $this[self::RowsPerStrip] = $rowsPerStrip;

        $values = [];
        $counts = [];
        $data = [];
        $stripsPerImage = floor(($imageLength + $rowsPerStrip - 1) / $rowsPerStrip);
        for ($i = 0; $i < $stripsPerImage; $i++) {
            $values[$i] = 0x10000;
            $data[$i] = TTagImageFile::compressTifData($compression, $this->getImageStripData($image, $rowsPerStrip, $i, $palette, $bitsPerSample));
            $counts[$i] = strlen($data[$i]);
        }
        $this[self::StripOffsets] = $values;
        $this[-self::StripOffsets]->setData($data);
        $this[self::StripByteCounts] = $counts;
    }

    public function setThumbnailImage($image, int $quality = -1): bool
    {
        if (!$image) {
            return false;
        }
        ob_start();
        imagejpeg($image, null, $quality);
        $jpeg = ob_get_clean();
        $jpeg = TJPEGMetaHelper::toSimplifiedJPEG($jpeg);
        $this[self::JPEGInterchangeFormat] = 0; //File Location Pointer Filled on write
        $this[-self::JPEGInterchangeFormat]->setData($jpeg);
        $this[self::JPEGInterchangeFormatLngth] = strlen($jpeg);
        $this[self::NewSubfileType] = self::NewSubfileTypeThumbnailBit;
        $this[self::ImageWidth] = imagesx($image);
        $this[self::ImageLength] = imagesy($image);
        $this[self::BitsPerSample] = [8, 8, 8];
        $this[self::Compression] = self::CompressionJPEGOld;
        $this[self::PhotometricInterpretation] = self::PhotometricInterpretationRGB;
        $this[self::SamplesPerPixel] = 3;
        $this[self::XResolution] = self::BasePixelsPerInch;
        $this[self::YResolution] = self::BasePixelsPerInch;
        $this[self::ResolutionUnit] = self::ResolutionUnitInch;
        return true;
    }
}

This is my vision of how the [EXIF] TIFF IFD [TMap] should work and is now working. Where the Map key is negative, eg $this[-self::JPEGInterchangeFormat], it returns the TImageFileField class rather than the contained value. In effect, everything about how TIFF/EXIF works is internal and encapsulated by TEXIF, EXIF Directories, and the TIFF classes.

By simply setting the TIFD-TMap key value, it creates a TImageFileField, sets the ID/tag$#, type (from the preset), and array count/#components. a Custom TImageFileField in an non-preset tag ID can be set as well.

Some interesting [easy] built in features: RGBA color bit size supports 1-64 bits, not just 8 bits. It is bit shifted to 8 bits RGB for PHP GD, but read image data is retained at full color resolution. Alpha is also supported for RGB and Palette images. Palette images can be 1 bit to 8 bit in size.

Only basic compression is supported: uncompressed, gzcompress, and Apple's PackBits. This is an area for others to upgrade if they want/need it after this is released. LZW, CCITT 1D, and a few other compression algorithms would be easy to implement. Tiles aren't implemented. Planar Data isn't supported here (triples the StripOffset count by 3 for each Red, Green, and then Blue data). only unsigned int is supported. (signed and float data types are unsupported initially).

All the Tag IDs (uInt16) exist as public const in the TTagImageFileDirectory and TEXIFDirectory, TEXIFMainDirectory ,TEXIFGPSDirectory, etc. Their associated values are also public const for easy access. This could be made into an Enum class on its own, but I like how compact this is.

EXIF usually uses the Thumbnail method rather than the set/get Image method. The TBitReader and TBitWriter are getting their unit tests now to ensure the unit tests for EXIF.

$this['ResolutionUnit'] also works besides $this[self::ResolutionUnit]. The string is converted into the integer tag id.

You can get a sense for how TEXIFX and TTagImageFileX is working.

belisoful commented 1 year ago

The TBitReader and TBitWriter unit tests are passing. They are uniform to each other, work as expected, do not add too much overhead to the special case of writing aligned bytes. The overhead to read and write is min(), 3 bit shifts, a bitwise &, bitwise |, and a few unit integer math functions (+/-). The formation of readBits and writeBits algorithm was generated by ChatGPT. It provided a reasonable base-line function. It had a few logic errors (not subtracting a variable here or there) but it almost got it.

TBitReader can take string data, a php Stream (wrapped in a TDataReader inside the reader), or a TDataReader. TBitWriter can take a null, phpStream (wrapped in a TDataWriter inside the writer), or a TDataWriter. Null generates an internal MEMORY php stream wrapped in TDataWriter.

2023, PRADO gets bit reader/writer IO. LZW and CCITT 1d need this for TIFF, but those two algorithms are not in scope for this iteration. an IO class for TLZWCompressor and TCCITT1DCompressor can be implemented from the TIFF documentation when the time comes. https://web.archive.org/web/20210108174645/https://www.adobe.io/content/dam/udp/en/open/standards/tiff/TIFF6.pdf

ahhh... finally moving to the Tiff and EXIF unit tests. The Custom App Tag behavior is only the remaining"complex" aspect.

belisoful commented 1 year ago

I'm getting some great configuration options for TIFF and making sure it is conformant for EXIF.

Interestingly, if EXIF-TIFF is reading only 16 bit, 32 bit, or 64 bit values from RGB[A], then it reads with endian specified format.

For simplicity, the TBitReader/Writer is going to need a special mode for these special cases. This some regression and regression unit tests.

Tiff can read/write to unsigned int, signed int, and even float formats when the bits are 8/16/32/64 aligned. That is a possibility in TIFF: to read alternate data formats besides just unsigned int.

So. Of course. I'm looking at the TDataReader and TDataWriter. There is no built in FP16 or FP8 support in PHP, eg pack and unpack do not have FP16 or FP8 options. These formats are particularly useful for graphics and A.I. computation.

Given the year 2023, PRADO should have an FP16 and FP8 option for reading and writing binary data. So, more regression. This should be simple enough though.

belisoful commented 1 year ago

The code for half and mini floats is done but not the unit tests [yet]. The half and mini floats read/write are a functional point for compatibility of PRADO with new AI systems. There are two interesting functions. One takes PHP floats and encodes it into a lower bit ranged float storage format, the other is the reverse. the storage format is an int containing the bits of the float represented by a signed bit, arbitrary exponent bits, and arbitrary mantissa bits. So we can encode and decode 24 bit floats, 20 bit floats, 16 bit floats, 10 bit floats, etc. FP16 is the default format, and Binary16, FP8-E5M2 and FP-E4M3 have functions as well.

If the class utility for reading and writing data with a stream isn't needed, these floating encoder/decoder functions could be useful for AI or other graphics code.

I'm proofing the Tag Configurations and there are many interesting specifications and extensions to TIFF. EXIF is but one; the important one for PRADO publishing that I'm looking to ensure implementation.

Some other TIFF specs do affect basic implementation that should be taken into account [only once]. for instance, there is a StripRowCounts in TIFF-FX that allows differing rows in each strip rather than a define RowsPerStrip. This is part of the Internet Fax extension for TIFF. StripRowCounts is very easy to implement at this stage and part of the tags and custom tags. The Reader will utilize this, however the tiff image writer will not (but could easily be added, though such specialization is not needed at this point).

Here are the TIFF resources in one place: TIFF 6 Official Final Spec 1992: https://web.archive.org/web/20210108174645/https://www.adobe.io/content/dam/udp/en/open/standards/tiff/TIFF6.pdf

Adobe PageMaker Tech Note No.1 (1995) https://awaresystems.be/imaging/tiff/specification/TIFFPM6.pdf Defines the SubIFD Tag and IFD as data type = 13. and is recommended for IFDs. I would have made IFD a bit outside the lower nibble, so not 13, but 20, which is a "LONG" with [16] 0x10 OR-ed.

Adobe Photoshop Tech Note No.2 (2002) https://awaresystems.be/imaging/tiff/specification/TIFFphotoshop.pdf

TIFF-FX File Format for Internet Fax https://www.ietf.org/rfc/rfc2301.txt

EXIF Official Specification https://web.archive.org/web/20190624045241if_/http://www.cipa.jp:80/std/documents/e/DC-008-Translation-2019-E.pdf

TIFF Annotation Specification: https://web.archive.org/web/20050309141348/https://kofile.com/support%20pro/faqs/annospec.htm

Adobe Digital Negative Specifications [DNG] (2021) https://helpx.adobe.com/camera-raw/digital-negative.html

GeoTIFF https://web.archive.org/web/20160814180021/http://www.remotesensing.org/geotiff/spec/geotiff2.4.html

Aware Systems Tag Reference https://awaresystems.be/imaging/tiff/tifftags.html

EXIFTool Tag Reference https://exiftool.org/TagNames/EXIF.html

I'll add others if/when I find them, and please comment if there are other EXIF-TIFF extensions you think should be included in the baseline Exif-TIFF implementation. Your feedback is what these comments are for. Also documenting progress.

I suspect that HALFFLOAT (16 bit float) and MINIFLOATRANGE/MINIFLOATPRECISION (8 bit float) data types are on the way.

The coming PRADO implementation of TIFF and EXIF is very reasonable, straight forward, self-contained, outwardly transparent, well abstracted, and heavily dependent upon configuration. One of my goals is to have the low maintenance cost. The code is very straight forward, reasonable, and tight, but it's taken a LOT to get it to this point. It's a very fun and interesting challenge and I hope this is one of the better EXIF implementations. I wouldn't expect anything less of PRADO.

There is going to be a "CustomAppTag" class for all the private tags that will be accounted for (but not "final spec"ed).

Ironically, All this effort is going into making EXIF work as simply and seamlessly as possible.

belisoful commented 1 year ago

Once the Tiff IFD and Field reader is functional (as it is now -[ahem, mostly]), reading TIFF is actually quite simple. [with the right tools, of course]. Making a GD image from a Tiff structure is minor compared to making IFD/Fields work properly [deconstructing the official specs].

The EXIF & TIFF implementation is becoming functional very nicely. I am very happy with how simple it is designed to be used yet internally complex enough to CRUD such an (open) expansive standard.

Those who want to filter EXIF on upload (or within their website) should be very happy with this implementation. It's very PRADO like and uses PRADO core functionality (Behaviors) that other implementations couldn't bring. Using Behaviors for all the additional Custom App Tags allows the actual implementation to stay very clean and as close as possible to the official spec.

eg the EXIF-TIFF fields are not "polluted" with random custom app tags.

This is a very clean implementation of EXIF-TIFF.

belisoful commented 1 year ago

In working with EXIF-TIFF and the BitReader/Writer, there are a few bit-wise utility functions that should be abstracted.

So, There is a new class TBitUtility for the following:

static function colorBitShift(int $value, int $inBits, int $outBits)

public static function reverseBits(int $n, int $nbit): int

reverseByte, reverseShort, reverseLong, and reverseLongLong uses bit operations to reverse the bits in O(1).

public static function bitCount(int $n): int

public static function isNegativeFloat(float $value): bool Looks at the signed bit of a float to check for being negative. This returns true on negative 0, where there is no other method in typical PHP to check for negative 0. various components of the Image filtering on publish use -0 to mean the right or bottom of the image.

Putting these in a utility class makes sense. Are there any other bit functions to include?

These would need unit tests anyway, so putting them into their own utility class opens them up for the rest of the platform.

belisoful commented 1 year ago

The bit reader and writer have their unit tests. This took a bit longer because I was making sure it was conformant to the TIFF standard. I was working on the TIFF-EXIF code concurrently as well.

The Bit reader/writer has a few very interesting options. Bit Readers/writers typically only reads and writes unsigned int bits, but TIFF needs to be able to interpret those bits as signed and possibly as floats. So there is a parameter for how to format the bits, default Unsigned Int. Signed bit format extends the highest bit for being negative (twos complement). Float only works with 8/16/32 bit. These are the only bit sizes with defined Exponent bits and Mantissa bits (the fraction).

If someone knows how to read a 12 bit float and/or a 27 bit float (aka or, how many bits for exponent and mantissa given the number of "float" bits to interpret), I'll gladly update the code right now. There is are 2 fun methods in TDataReader/Writer (one in each) for encoding and decoding a float into/from a lower bit format by specifying the value, number of exponent bits and number of mantissa bits. Any arbitrary number of bits will work. so it can easily process a 12 bit, or 27 bit, or N-bit encoded "float" when the format (exponent/mantissa) is defined.

Another feature Is that float bit values can optionally be converted into integers and back. If a reader wants the raw floats for special computations/etc, they can have it. For the typical implementation, though, the float pixel data RGB[A] (between [0..1]) is converted into an int [0..(1<< $bits)-1]. Fun times.

Furthermore, an option exists for taking 16 bit and 32 bit data and reversing the endian for Small Endian (intel) data streams.

There is also an option for reversing the individual bits of each byte, so it reads not from the Most Significant Bit first but from the Least Significant Bit first. So, the Bit IO can read and write bits in either direction (right handed for MSB First, and left handed for LSB First). The TIFF documentation says that reading from LSB first is optional.... however, the TIFF Spec is a great proving ground for making the highest functioning Utilities classes possible.

The bit and byte reversals and bit format are basically "trivial" once they are completely understood from the Tiff Spec.

The TBitReader, TBitWriter, TDataReader, and TDataWriter are "must have" classes for PRADO 2023. Developing these with TIFF-EXIF Spec in mind makes these classes much easier to implement, as I have a goal, orientation, and requirements. Very glad to be of service here, even if it's taking a while to get it "right"/"bestest".

Also, Lastly, I wanted to post the new TAssetManager publishAsset code so people can have a sense of what it is, how it works, and why. Keep in mind that "publishAsset" is already a defined method in TComponent.

/**
 * Publishes an asset or file path. When the asset is a file path string,
 * {@link onDiscoverClass} is raised to find the right IAsset class for the
 * file.
 * This method will write the IAsset to a web accessible directory
 * and returns the URL for the published asset.
 * If the application is not in performance mode, the file modification
 * time will be used to make sure the published file is not changed.
 * If changed or not published, a asset will be written for publishing.
 *
 * @param \Prado\Web\Asset\IAsset|string $asset the asset to be published, or filepath
 * @param bool $checkTimestamp If true, file modification time will be checked even if the application
 *   is in performance mode, default false.
 * @param ?string $predst the predetermined destination path, used internally
 *   for publishing directories.
 * @throws TInvalidDataValueException if the file path to be published is
 *   invalid
 * @throws TInvalidDataValueException when the $predst is not in the BasePath or 
 *   when the AssetPath is empty (bun not null as null just returns a blank string).
 * @return string an absolute URL to the published asset
 * @since 4.2.2
 */
public function publish($asset, bool $checkTimestamp = false, ?string $predst = null): string
{
    $assetString = null;
    if (is_string($asset)) {
        if (isset($this->_published[$asset])) {
            [$url, $dst] = $this->_published[$asset];
            return $url;
        }
        $assetString = $asset;
        $asset = $this->ensureAsset($asset);
    } elseif (!($asset instanceof IAsset)) {
        $asset = $this->ensureAsset($asset);
    }
    $path = $asset->getAssetFilePath();
    if ($path === null) { //publish was cancelled.
        if ($assetString) {
            $this->_published[$assetString] = ['', ''];
            $this->_publishedAssets[$assetString] = $asset;
        }
        return '';
    }
    $assetString = rtrim($assetString ?? $path, DIRECTORY_SEPARATOR); // $path may end in DIRECTORY_SEPARATOR if its a directory.
    $path = $asset->dyAlterAssetFilePath($path); // TTarAsset, TAssetJPEGize. TAssetPNGize
    if ($isDir = ((substr($path, -1) === DIRECTORY_SEPARATOR))) {
        $path = rtrim($path, DIRECTORY_SEPARATOR);
    }
    if (isset($this->_published[$assetString])) {
        [$url, $dst] = $this->_published[$assetString];
        if ($asset instanceof IAssetPublishedCapture) {
            $asset->setPublishedPath($dst);
            $asset->setPublishedUrl($url);
        }
        return $url;
    } elseif (empty($path) || ($fileName = basename($path)) == '') {
        throw new TInvalidDataValueException('assetmanager_filepath_invalid', $path);
    } else {
        if ($predst && strncmp($this->_basePath, $predst, strlen($this->_basePath)) !== 0) {
            throw new TInvalidDataValueException('assetmanager_dst_not_in_basepath', $predst); //security check
        }
        $dir = $this->hash($isDir ? $path : dirname($path));
        $url = $this->_baseUrl . ($predst ? str_replace(DIRECTORY_SEPARATOR, '/', substr($predst, strlen($this->_basePath))) :
            '/' . $dir . ($isDir ? '' : '/' . $fileName));
        $dst = $predst ? $predst : $this->_basePath . DIRECTORY_SEPARATOR . $dir . ($isDir ? '' : DIRECTORY_SEPARATOR . $fileName);
        if ($asset instanceof IAssetPublishedCapture) {
            $asset->setPublishedPath($dst);
            $asset->setPublishedUrl($url);
        }
        $this->_published[$assetString] = [$url, $dst];
        $this->_publishedAssets[$assetString] = $asset;
        $moddst = $asset->dyAlterModifationFilePath($dst . ($isDir ? DIRECTORY_SEPARATOR : '')); //For TTarAsset, map to the dst md5 file
        if ($isModDir = ((substr($moddst, -1) == DIRECTORY_SEPARATOR))) {
            $moddst = rtrim($moddst, DIRECTORY_SEPARATOR);
        }
        if (($isModDir && !is_dir($moddst) || (!$isModDir && !is_file($moddst))) || $checkTimestamp || $this->getApplication()->getMode() !== TApplicationMode::Performance) {
            $this->produceAsset($asset, $dst, $path);
        }
        return $url;
    }
}

public function ensureAsset($asset): IAsset
{
    if (is_string($asset)) {
        $realpath = TAsset::virtualpath($asset); //cleans up '', relative paths, etc
        if (empty($asset) || ($realpath !== false && strncmp($this->getBasePath(), $realpath, strlen($realpath)) === 0)) {
            throw new TInvalidDataValueException('assetmanager_filepath_invalid', $asset); //assets cannot copy parent, recursive.
        }
        $class = $this->onDiscoverClass($this->getDefaultAssetClass(), $asset);
        if (!class_exists($class) || !is_a($class, '\Prado\Web\Asset\IAsset', true)) {
            throw new TInvalidDataTypeException('assetmanager_invalid_class', $class);
        }
        $asset = new $class($asset);
    } elseif (!($asset instanceof IAsset)) {
        throw new TInvalidDataTypeException('assetmanager_invalid_asset', get_class($asset));
    }
    return $asset;
}

protected function produceAsset($asset, $dstFilePath, $vsrc = null)
{
    $dst = dirname($dstFilePath);
    if (!is_dir($dst)) {
        @mkdir($dst);
        @chmod($dst, Prado::getDefaultPermissions());
    }
    $dstMod = @filemtime($asset->dyAlterModifationFilePath($dstFilePath));
    if ($dstMod === false || $dstMod < $asset->getAssetModificationDate()) {
        if (!$vsrc) {
            $vsrc = $asset->getAssetFilePath();
        }
        Prado::trace("Publishing asset $vsrc to $dstFilePath", 'Prado\Web\TAssetManager');
        if ($asset->publish($dstFilePath) === null && $asset->hasEvent('onProcessAsset')) {
            $asset->onProcessAsset($dstFilePath);
        }
    }
}

//   In TAsset
public function publish(string $dst)
{
    if (($src = $this->getAssetFilePath()) === false) {
        throw new TInvalidDataValueException('asset_assetfilepath_invalid', $src);
    }
    if ($src === null) {
        return false;
    }
    if (empty($dst)) {
        throw new TInvalidDataValueException('asset_dst_filepath_invalid', '');
    }
    if (($return = $this->dyWriteAsset(0, $dst)) !== 0) {
        return $return;
    }
    $tmpFile = $dst;

    // Directories are copied
    if (substr($src, -1) === DIRECTORY_SEPARATOR) {
        $this->copyDirectory(rtrim($this->getAssetOriginalFilePath(), DIRECTORY_SEPARATOR), $dst);
        $this->onProcessAsset($dst);
        return true;
    } else {
        $uid = str_pad(sprintf('%x', crc32((string) microtime(true))), 8, '0', STR_PAD_LEFT);
        $tmpFile = dirname($dst) . DIRECTORY_SEPARATOR . 'tmp-' . $uid . '.' . basename($dst);
    }
    if ($this->writeAsset($tmpFile)) {
        $this->onProcessAsset($tmpFile);
        if (@chmod($tmpFile, Prado::getDefaultPermissions()) && $tmpFile !== $dst) {
            @rename($tmpFile, $dst);
        }
        return true;// onProcessAsset already raised, return true
    }
    return false;
}

Important to note, Publishing of Assets produces the Asset into a temp file, and raises an event OnProcessAsset to process the file with any filters, like Image filters [resizing, masking, orientation, etc], scrubbing EXIF data (like GPS data and photographer identifying info [like camera and lens serial numbers]), or other asset publishing event handlers on the temp file. The PRADO Asset Publishing mechanism has needed an event per published Asset like this since the beginning, and it wouldn't be possible without putting the file into temp first, editing the temp, then renaming to the final output name. Many clients could be "working" on publishing asset at the same time and collisions are handled gracefully.

ensureAsset raises the event OnDiscoverClass so the proper TAsset subclass (as [namespace with class] string) can be identified for publishing a specific asset. eg. If a specific path is used to identify Database Assets, this event can read the File Path and give back (e.g.) a TAppDBFileAsset rather than the default TFileAsset. This allows templates to publish Alternate files with (e.g.) <%~ /DBConnection/myDBpath/dbFolder/FileStoredInDb.jpg %> by configuring a TAssetDiscovery Behavior. The TAssetDiscover behavior keys from a preg_match and returns its configured class.

TAssetDiscovery and attachment (to TAssetManager) can be implemented entirely in the app configuration.

The code works the way it does because not all publishing files exist in the file system prior to being produced. Virtual assets (like TDot #831) and Database assets will not exist in the file system until they are produced.

Another important detail (not exactly specified in the code above) is that when publishing directories, the internal format is to use a trailing DIRECTORY_SEPARATOR to specify that it is a directory to be published rather than a file. Publishing directories works slighting differently than publishing files. Calling the Publish method to Publishing a directory doesn't need the trailing Slash but the internals of asset publishing does put it there for directories.

This is a huge change for publishing assets that should be 100% transparent if none of the extra features are used/needed. This adds a a minor amount of overhead to publishing files than before, however the features it enables is massive. PRADO Asset Publishing has needed a per asset publishing event for a long long time. The publish asset event necessitates putting the file into a temp file.

And regarding premade asset classes, there are: TFileAsset, TGDFAsset, and TTarAsset are complete. The TFileAsset allows for publishing of system files and virtualization. TGDFAsset allows for GD Font files to be rewritten for proper system byte endian format. and.... TTarAsset is a bit obvious, but it does do some interesting manipulations because it is a [TAR] file, publishes as a directory, and the time stamp is checked against a second MD5 file rather than the TAR or directory.

The "JPG-izer"/XXXizer also uses some of these (TAR publishing) functions to make the end result file look like a JPEG/XXX when its publishing a PNG/WebP/etc but before actual conversion into the final "XXX-ized" publishing image format. There are classes for JPGize, PNGize, WebPize, GIFize, BMPize, XBMize, and WBMPize. These converter filters can do all files or a specific set of files (depending on file path or even meta data). (insert "WOW" meme sound).

Making [image] filters dependent upon [image] meta data is one of the more interesting features here and one reason why a proper PRADO EXIF implementation is important. TIPTC is simple enough to be implemented in just one class, EXIF will required 11 classes (5 base classes for: Base Field, Base Directory, Base File, Tag Directory [main implementation of TIFF], and Tag File; and 6 EXIF classes for: EXIF [EXIF (File) extends TIFF file], EXIF Main Directory [adds only two fields to the Tag Directory], EXIF Sub Directory (where the primary EXIF tags exist), EXIF GPS Directory, EXIF Interop Directory, and EXIF Custom App Tag Behavior for the plethora of the 3rd party tags)

EXIF in less than 11 classes could be done but would not be properly abstracted.

I'm proofing and writing the documentation for the the base and EXIF classes now. Yay!!! It's taken a while to get to this point. It's looking awesome.

Again, my humble apologies for not posting this sooner, for not getting feedback on this sooner, and for this taking as long as it is. What was a 3 week excursion into Asset Publishing turned into many months. Lots of good utility classes are coming from it though.

This is very very exciting.

We've needed a per "asset publish event" in PRADO since the beginning. It was only a matter of how to implement it.

belisoful commented 1 year ago

There is a lot of code for all the new functionality of the TAssetManager behaviors. Everything is being unit tested. Just like Cron, that added a large number of new unit tests (particularly for TTimeScheduler), this adds a lot of tests too. so far, the total is 1744 excluding EXIF-TIFF. The graphical tests use low pixel counts to reduce testing time. In some instances it validates each pixel against a generated reference image.

Documenting EXIF-TIFF is, comically, an enterprise by itself. As the methods are proofed, they get documented. There is a lot to document beyond everything else already complete.

belisoful commented 1 year ago

Yes. The TIFF-EXIF code is taking a while. The reading and writing is actually fairly simple. The utilities have stiff requirements and are complete. eg. TBitReader/Writer having both MSB and LSB first capabilities. Many Internet Faxes via TIFF use LSB First data because its standard in physical transmission (eg modems/faxes). I'll be checking in the new utility classes very soon to upgrade things and prepare for the updated Asset Manager.

There are some nuances to the fields that other libraries may not account for as they are deep in the TIFF specification, extensions, and Tech Notes. The nuances are not difficult, just complex to understand the spec and translate into code.

The code for EXIF-TIFF is actually very straight forward and simplified/abstracted from the spec. As such, easy to maintain and debug. It is also as complete as possible, as it should be it it's going to be part of a platform like PRADO. That's a reason why this is taking a while.

I see the new PRADO TIFF-EXIF capabilities as another selling, usage, and integration point.

belisoful commented 1 year ago

One thing is sometime not like the other. IPTC is not like TIFF-EXIF.

I am happy with how the EXIF Directory and Field works (reading, writing, setting, interoperating), but the management of IFD within the EXIF-TIFF, especially adding IFD new IFD while simultaneously keeping track of all their naming is the current engineering issue. making the (child ifd) Fields work with the TImageFile to track names is one of the last pieces.

belisoful commented 1 year ago

The EXIF, Tiff, and 3rd party app tag data configuration and directory management is working nicely. One of the nicely synchronistic things is that the 3rd party app tag data is a perfect behavior to open up and implement as a PRADO configuration initialization Class-behavior attachments. This will make customizing EXIF or TIFF for an application as simple as defining the new TIFF Extension tag in the PRADO application configuration.

Use case: An application embeds its own defined (class) IFD of field data. The PRADO application.xml would define the TTiffExtension behavior to embed the application defined IFD class to register the parent directory, tag id, tag name, class, and optional name. TTiffExtension is designed as a Class Behavior so it only needs one instance and has no per instance data needed. It would attach to the TImageFileDirectory and configure new tags with associated meta data.

Each TTiffExtension would define multiple tags. Each tag has a Tag ID, parent IFD Tag-ID (* for all, 0 for primary IFDn, or a specific parent ifd tag), Tag Name, Tag Type, Tag Component Count (-1 for 'n'-size, undef and ascii usually are -1), Tag Properties as bits (17 so far), and lastly property data.

Some fields in TIFF can be pointers to data in the file and they have secondary tags for the byte count. Some fields in TIFF depend on other fields, either existence or specific values. Byte count depends upon their Pointer tag. Some fields in TIFF are IFDs and consist of a tree of Image File Directories. These automatically instance as TImageFileDirectory, but can be configured to instance with a specific subclass that has logic for parsing and writing fields in specific ways.

The TTiffExtension can be subclassed to implement custom field parsing logic as a dynamic event on dyParseField.

17 properties for each individual tag: 1 - Writable - Has a data format 2 - Mandatory - is required to exist (maybe conditionally, but not encoded) 3 - System - Is written by the System rather than the user 4- Compress16 - The data type is mutable between 2 byte and 4 byte unsigned int 5 - Compress8 - The data type is mutable between 1 byte and 2 byte, can be combined with Compress16 6 - Bits - Is the UInt32 a bit format or a number format? bit formats retain their bits, but for 32 bit systems, numbers more than ~2.1 billion cannot be represented so are output as floats that can represent the value. 7 - Pointer - Is the tag a pointer to data in the file? 8 - No Allocation - Is the data pre-allocated? FreeSpace is preserved and not pre-allocated. 9 - Depends - the tag depends on the existence of another tag or a value within the tag 10 - Not EXIF - Some TIFF tags are not found in the EXIF standard while others are. Exclusionary to enable easier extensions to EXIF. 11 - Date - The field follows TIFF Date Format, converts PHP date strings or unix time to a date 12 - Time - The field follows the TIFF Time Format, converts PHP date-time strings or unix time to a time. 11 & 12 can be combine into a TIFF Date-Time Format, and converts PHP date-time strings or unix time to a date time. 13 - Time Offset - The field follows the TIFF time zone offset format. 14 - Time Subsecond - The field follows the Tiff sub second format. 15 - Parse - Applies only to UNDEF fields and allows parent IFD to parse the field with its own logic 16 - Final Pass - The tag values are allocated and written in a second pass when writing in two passes. primare image pixel data in StripOffsets and TileOffsets (pointers), are written in a second pass so the data is after all the meta data about the image. 17 - Security - The Field is flagged as requiring deletion for user security. GPS, Camera serial numbers, camera owner's name, lens info, computer system (eg "originalfilePath"), etc is flagged.

belisoful commented 1 year ago

OK. In reviewing Yii Asset Manager, the new PRADO Asset Manager is much more advanced and makes use of behaviors and dynamic events that Yii simply can't do

belisoful commented 1 year ago

I'm looking at Yii2's AssetManager. Are there any features you believe should be part of a new Prado Asset Manager?

Maybe an event OnPublishAsset?

Maybe different hash algorithms for the asset directory?

Maybe the ability to publish SymLinks instead of the files themselves?

I think I can make a highly configurable Cache Busting behavior that reconfigures the asset manager better than Yii2's solution. Also, blocking files in the new asset manager is better and more configurable as well.

belisoful commented 1 year ago

Many of the utility classes are being proofed and submitted from this excursion.