heitzmann / gdstk

Gdstk (GDSII Tool Kit) is a C++/Python library for creation and manipulation of GDSII and OASIS files.
https://heitzmann.github.io/gdstk/
Boost Software License 1.0
344 stars 86 forks source link

Add write_gds_to_buffer function to export GDS to BytesIO #265

Closed tcosz closed 2 months ago

tcosz commented 3 months ago

Hello,

First of all thank you very much for providing such an powerful and easy to use package. We require to keep the GDS data in memory to pass it as input to another Python package function, currently the code only supports exporting to a file. We believe adding this functionality to buffer the output in-memory will solve the limitations we have to use gdstk package for preprocessing complex layout data.

Could you please consider adding the functionality in the core package? Here is a suggested code implementation. Unfortunately despite our best attempts we are not able to correctly install an editable package version with the newly compiled code, but we believe the effort to implement should be limited.

Typically this is what we would like to be able to do in Python

import gdstk
from io import BytesIO

# Create a new library
lib = gdstk.Library()

# Geometry must be placed in cells.
cell = lib.new_cell("TOP")

# Create the geometry (a single rectangle) and add it to the cell.
rect = gdstk.rectangle((0, 0), (2, 1))
cell.add(rect)

# Use the write_gds_to_buffer function to get the GDSII data as bytes
gds_data = lib.write_gds_to_buffer()

# Store the bytes data in a BytesIO object
gds_buffer = BytesIO(gds_data)

The main idea is to replace in a new write_gds_to_buffer function the first argument of the Library::write_gds function, const char* filename , by std::vector<char>& buffer. and to replace the file output by (complete code to write_gds_to_buffer is pasted further below).

    buffer.clear();
    char* buffer_data = nullptr;
    size_t buffer_size = 0;

    FILE* out = open_memstream(&buffer_data, &buffer_size);

        if (out == NULL) {
          if (error_logger) fputs("[GDSTK] Unable to open memory buffer for output.\n", error_logger);
          return ErrorCode::OutputFileOpenError;
    }

*include/gdstk/library.hpp** ( inside struct Library)

    ErrorCode write_gds_to_buffer(std::vector<char>& buffer, uint64_t max_points, tm* timestamp) const;

*src/library.cpp** (inside namespace gdstk)

ErrorCode Library::write_gds_to_buffer(std::vector<char>& buffer, uint64_t max_points, tm* timestamp) const {
    ErrorCode error_code = ErrorCode::NoError;

    buffer.clear();
    char* buffer_data = nullptr;
    size_t buffer_size = 0;

    FILE* out = open_memstream(&buffer_data, &buffer_size);

    if (out == NULL) {
        if (error_logger) fputs("[GDSTK] Unable to open memory buffer for output.\n", error_logger);
        return ErrorCode::OutputFileOpenError;
    }

    tm now = {};
    if (!timestamp) timestamp = get_now(now);

    uint64_t len = strlen(name);
    if (len % 2) len++;

    uint16_t buffer_start[] = {6,
                               0x0002,
                               0x0258,
                               28,
                               0x0102,
                               (uint16_t)(timestamp->tm_year + 1900),
                               (uint16_t)(timestamp->tm_mon + 1),
                               (uint16_t)timestamp->tm_mday,
                               (uint16_t)timestamp->tm_hour,
                               (uint16_t)timestamp->tm_min,
                               (uint16_t)timestamp->tm_sec,
                               (uint16_t)(timestamp->tm_year + 1900),
                               (uint16_t)(timestamp->tm_mon + 1),
                               (uint16_t)timestamp->tm_mday,
                               (uint16_t)timestamp->tm_hour,
                               (uint16_t)timestamp->tm_min,
                               (uint16_t)timestamp->tm_sec,
                               (uint16_t)(4 + len),
                               0x0206};
    big_endian_swap16(buffer_start, COUNT(buffer_start));
    fwrite(buffer_start, sizeof(uint16_t), COUNT(buffer_start), out);
    fwrite(name, 1, len, out);

    uint16_t buffer_units[] = {20, 0x0305};
    big_endian_swap16(buffer_units, COUNT(buffer_units));
    fwrite(buffer_units, sizeof(uint16_t), COUNT(buffer_units), out);
    uint64_t units[] = {gdsii_real_from_double(precision / unit),
                        gdsii_real_from_double(precision)};
    big_endian_swap64(units, COUNT(units));
    fwrite(units, sizeof(uint64_t), COUNT(units), out);

    double scaling = unit / precision;
    Cell** cell = cell_array.items;
    for (uint64_t i = 0; i < cell_array.count; i++, cell++) {
        ErrorCode err = (*cell)->to_gds(out, scaling, max_points, precision, timestamp);
        if (err != ErrorCode::NoError) error_code = err;
    }

    RawCell** rawcell = rawcell_array.items;
    for (uint64_t i = 0; i < rawcell_array.count; i++, rawcell++) {
        ErrorCode err = (*rawcell)->to_gds(out);
        if (err != ErrorCode::NoError) error_code = err;
    }

    uint16_t buffer_end[] = {4, 0x0400};
    big_endian_swap16(buffer_end, COUNT(buffer_end));
    fwrite(buffer_end, sizeof(uint16_t), COUNT(buffer_end), out);

    fclose(out);
    return error_code;
}

*python/library_object.cpp**

static PyObject* library_object_write_gds_to_buffer(LibraryObject* self, PyObject* args, PyObject* kwds) {
    const char* keywords[] = {"max_points", "timestamp", NULL};
    PyObject* pytimestamp = Py_None;
    tm* timestamp = NULL;
    tm _timestamp = {};
    uint64_t max_points = 199;
    if (!PyArg_ParseTupleAndKeywords(args, kwds, "|KO:write_gds_to_buffer", (char**)keywords,
                                     &max_points, &pytimestamp))
        return NULL;

    if (pytimestamp != Py_None) {
        if (!PyDateTime_Check(pytimestamp)) {
            PyErr_SetString(PyExc_TypeError, "Timestamp must be a datetime object.");
            return NULL;
        }
        _timestamp.tm_year = PyDateTime_GET_YEAR(pytimestamp) - 1900;
        _timestamp.tm_mon = PyDateTime_GET_MONTH(pytimestamp) - 1;
        _timestamp.tm_mday = PyDateTime_GET_DAY(pytimestamp);
        _timestamp.tm_hour = PyDateTime_DATE_GET_HOUR(pytimestamp);
        _timestamp.tm_min = PyDateTime_DATE_GET_MINUTE(pytimestamp);
        _timestamp.tm_sec = PyDateTime_DATE_GET_SECOND(pytimestamp);
        timestamp = &_timestamp;
    }

    std::vector<char> buffer;
    ErrorCode error_code = self->library->write_gds_to_buffer(buffer, max_points, timestamp);
    if (return_error(error_code)) return NULL;

    PyObject* pybuffer = PyBytes_FromStringAndSize(buffer.data(), buffer.size());
    if (!pybuffer) {
        PyErr_SetString(PyExc_RuntimeError, "Failed to create Python bytes object from buffer.");
        return NULL;
    }

    return pybuffer;
}

Of course having a similar feature for the import_gds would be good for consistency but at the moment the enhancement on write_gds would help greatly.

Thank you.

dtzikas commented 3 months ago

A while ago I had created a pull request to import gds from istream (in order to support reading gzipped files). If you think it could be related to the "missing part" here, you can have a look at recently rebased branch

tcosz commented 3 months ago

Thank you for sharing your rebased branch, based on your feedback in the case of this thread a solution could be to use sstream (for std::ostringstream) which does not seem to add new external dependencies.

Eventually could the default implementation could also benefit from using the steam from the start when building up the the GDS data before writing to the file? In this approach all the procedures to export to GDS (cells, polygons, paths, etc) would need to replace the file input variable a the stream input variable. I am adding some documentation below for the discussion, could be interesting to run a new benchmark with a version of the codebase based on the stream.


Using std::ostream instead of FILE* for writing data can have several implications on performance and usability. Here are some points to consider:

Compatibility with Python's BytesIO

Using std::ostream is better compatible with exporting data for later use with Python's BytesIO because it provides a more flexible and seamless way to handle in-memory data streams. Here’s why:

  1. Seamless Data Handling:

    • std::ostream allows you to directly write data to an in-memory stream (like std::ostringstream). This data can then be easily converted to a standard container like std::vector<char>, which can be directly passed to Python as a bytes object.
  2. Direct Integration:

    • When using std::ostream, the written data can be captured directly into a string or byte array, facilitating straightforward integration with Python's BytesIO. This eliminates the need for intermediate file operations and reduces I/O overhead.
  3. Unified Interface:

    • std::ostream provides a unified interface for writing data, making the code cleaner and more modular. You can use the same function to write to files, network streams, or in-memory buffers, enhancing code reuse and maintainability.

Performance Implications

  1. Buffered I/O:

    • Both std::ostream and FILE* typically use buffered I/O. The buffering mechanism helps in reducing the number of system calls, which improves performance when writing large amounts of data.
    • std::ostream uses the C++ standard library's buffering, while FILE* uses the C standard library's buffering. The performance difference between these two might be negligible for most applications, but it could vary based on the implementation and the specific usage scenario.
  2. Flexibility and Abstraction:

    • std::ostream provides a higher level of abstraction, allowing you to write to various types of streams (e.g., file streams, string streams, custom streams) with the same interface.
    • This flexibility can lead to more maintainable and reusable code, potentially reducing the need for performance-tuning specific to I/O operations.
  3. Inline Function Calls:

    • With std::ostream, there is often less need for explicit memory management and error handling within the I/O code itself. This can lead to cleaner and potentially faster code due to fewer function calls and less error-checking overhead.
  4. Stream Operators:

    • The use of stream operators (<< and >>) in std::ostream can sometimes be more efficient for formatting and writing data compared to fprintf and fwrite with FILE*.

Usability and Maintainability

  1. Error Handling:

    • std::ostream provides better error handling through its state flags (e.g., badbit, failbit). This can simplify error detection and handling in your code.
    • With FILE*, you often need to check return values and handle errors explicitly.
  2. Code Clarity:

    • Using std::ostream can make the code more readable and expressive. The stream interface is often more intuitive and easier to use, especially for C++ developers familiar with the standard library.
  3. Integration with C++ Features:

    • std::ostream integrates seamlessly with other C++ features like RAII (Resource Acquisition Is Initialization), smart pointers, and the standard library's algorithms and containers.
    • This integration can lead to safer and more robust code, as resource management and error handling are more straightforward.

Example: Performance Comparison

For a concrete performance comparison, you would typically need to measure the actual performance of both implementations in your specific use case. Here's a simplified example to illustrate potential differences:

Using FILE*

#include <cstdio>
#include <chrono>

void write_using_file(const char* filename, const char* data, size_t size) {
    FILE* file = fopen(filename, "wb");
    if (file) {
        fwrite(data, 1, size, file);
        fclose(file);
    }
}

int main() {
    const size_t size = 1024 * 1024 * 100; // 100 MB
    char* data = new char[size];
    std::fill_n(data, size, 'A');

    auto start = std::chrono::high_resolution_clock::now();
    write_using_file("output_file.bin", data, size);
    auto end = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> elapsed = end - start;
    printf("FILE* write time: %.6f seconds\n", elapsed.count());

    delete[] data;
    return 0;
}

Using std::ostream

#include <fstream>
#include <chrono>

void write_using_stream(const char* filename, const char* data, size_t size) {
    std::ofstream file(filename, std::ios::binary);
    if (file) {
        file.write(data, size);
    }
}

int main() {
    const size_t size = 1024 * 1024 * 100; // 100 MB
    char* data = new char[size];
    std::fill_n(data, size, 'A');

    auto start = std::chrono::high_resolution_clock::now();
    write_using_stream("output_stream.bin", data, size);
    auto end = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> elapsed = end - start;
    printf("std::ostream write time: %.6f seconds\n", elapsed.count());

    delete[] data;
    return 0;
}

Conclusion

For modern C++ development, std::ostream is generally preferred due to its integration with the standard library and better abstraction.

heitzmann commented 3 months ago

Unfortunately I can't take the required time to implement this modernization, but I'm now open to merging PRs with it (seeing as it will facilitate for future contributions too).

If anyone is willing to take ownership of the project, I'm open to it as well, because I haven't been able to keep up with feature requests for the past year. @nmz787-intel @tvt173 @wshanks @jatoben @dtzikas @tcosz @joamatab

tvt173 commented 2 months ago

Thanks @heitzmann. gdspy and gdstk have been invaluable to me over the past decade or so in various projects and as the original backends for gdsfactory. However, we recently switched the backend of gdsfactory to klayout, and I have no time to maintain another open source package either. I invite anyone interested in an alternative to give gdsfactory a try though!

https://gdsfactory.github.io/gdsfactory/index.html

nmz787-intel commented 2 months ago

FYI if you're on linux you can get away with just using a tmpfs mount point and no code updates to GDSTK:

import subprocess
import tempfile
import os

def check_ram_disk(mount_point='/dev/shm'):
    try:
        # Execute the mount command and decode the output
        result = subprocess.run(['mount'], stdout=subprocess.PIPE)
        output = result.stdout.decode('utf-8')

        # Check if the specified mount point is mounted as tmpfs
        return f'{mount_point} type tmpfs' in output
    except Exception as e:
        print(f"Error checking RAM disk: {e}")
        return False

def create_temp_file_in_ram(mount_point='/dev/shm'):
    if check_ram_disk(mount_point):
        # Create a temporary file in the specified RAM disk
        with tempfile.NamedTemporaryFile(dir=mount_point, delete=False) as tmp_file:
            print(f'Temporary file created at: {tmp_file.name}')

            # Write some data to the temporary file
            tmp_file.write(b'Hello, this is a test data written to RAM disk!')
            tmp_file.flush()  # Ensure all data is written to the file

        # Read and print the content from the file
        with open(tmp_file.name, 'rb') as file:
            content = file.read()
            print(f'Content of the file: {content}')

        # Clean up the file after use
        os.unlink(tmp_file.name)
        print(f'File {tmp_file.name} has been deleted.')
    else:
        print(f"{mount_point} is not mounted as tmpfs. RAM disk is not available.")
tcosz commented 2 months ago

Thank you very much for the tip for writing to the RAM directly from Python. Closing the thread since the C++ code update to implement IO streaming would require a lot of efforts.