Extend LedgerEngine to dump and restore accounting data

lasuk commented 3 months ago

Tasks:

Add method(s) to dump the entire accounting data to CSV files to the abstract LedgerEngine class (accounts, tax codes, ledger).
(Optional) also dump into a single HDF5 file, the data format typically used for persistant storage of pandas data frames.
Add method to restore dumped data.
Test this functionality by using the MemoryLedger or TestLedger class.
(Maybe later) add methods to dump and restore fx_adustments and price date to StandaloneLedger.

AlexTheWizardL commented 2 months ago

Proposed solution and questions

Add method(s) to dump the entire accounting data to CSV files to the abstract LedgerEngine class (accounts, tax codes, ledger).

Since our data is stored in the DataFrames we can create only 1 generic method for all our entities.
Interface should look like to_csv(df: pd.DataFrame, filename: str = "file", path: str = "."

Example:

def to_csv(df: pd.DataFrame, filename: str = "file", path: str = "."):
    """Save a DataFrame to a CSV file.

    Args:
        df (pd.DataFrame): The DataFrame to save.
        filename (str): The name of the file to save without extension. Default is 'file'.
        path (str): The directory path where the file will be saved. Default is the current directory.

    Raises:
        ValueError: If the DataFrame is empty.
        FileNotFoundError: If the specified path does not exist.
    """
    if df.empty:
        raise ValueError("The DataFrame is empty and cannot be saved.")

    if not os.path.exists(path):
        raise FileNotFoundError(f"The specified path does not exist: {path}")

    file_path = os.path.join(path, f"{filename}.csv")

    try:
        df.to_csv(file_path, index=False)
        print(f"DataFrame successfully saved to {file_path}")
    except Exception as e:
        raise Exception(f"An error occurred while saving the DataFrame: {e}")

Questions:

Personally I prefer to have only one generic method instead of methods for each entity. What is your thoughts?
What is your thoughts about error handling?
Should we pass index=False as a parameter to this method?

(Optional) also dump into a single HDF5 file, the data format typically used for persistant storage of pandas data frames. Questions:
- We need to place all entities that we have (accounts, ledger, vat...) just in on file?
  - If yes
    - we should choose an interface for this. Dict like so: {'df_name': df} or a nested dataFrames with name, df columns
    - Also we should store dataFrames names in some constants to have a persistant names and use it for reading and writing: store.put('df1', df1) retrieved_df1 = store['df1']
  - If not
    - we can also create a generic method with interface like so: to_hdf5(df: pd.DataFrame, filename: str = "file", path: str = "."
Add method to restore dumped data. Questions:
- Should this be only one method or one for each entity
- Would that method(s) access to the class variables or would have some returning interface and return restored data
- Should we restore the data from the csv file, HDF5 file, or just pass to the method(s) path of the file with an extension and then method should deal wit both?
Would be better to test it with the MemoryLedger class since this would be the freshest* one

AlexTheWizardL commented 2 months ago

Data Types to dump and resrore:

LedgerEngine scope:

Ledger
Account
Vat codes
Price

StandaloneLedger scope:

fx_adustments

We'll need to have 2 public methods: dump(), restore(). No need to create methods for each dataType, because they are already there from DataFrame.to_csv() method

.Zip file vs .HDF5 file ?

restore() method should be abstract in the LedgerEngine class because it is agnostic of a storing format/method

AlexTheWizardL commented 2 months ago

Implementing `dump()` and `restore()` Methods

We need to manage three key entities: ledger, accounts, and VAT codes. Need to implement two methods:

dump(): This method should store all the accounting data from the DataFrames into a single file.
restore(): This method should read the data back from the file and restore system to the original state.

The goal is to efficiently store and retrieve this data while ensuring data integrity, proper handling of data types.

Problem Definition

In our accounting system, we need to efficiently store and retrieve multiple data entities, including ledger, accounts, and VAT codes. Initially, I explored using advanced storage formats like HDF5 and Parquet. However, these formats introduced significant challenges, particularly with handling various data types (dtypes). These complications added unnecessary complexity to what should be a straightforward task.

Given these challenges, the need for a simpler and more transparent solution became clear—one that avoids the pitfalls of specialized libraries and complex formats while still providing a clear and manageable way to store and retrieve our data.

Solution: Using a `.zip` Archive with CSV Files

To address these challenges, I decided to store each DataFrame as a separate CSV file within a .zip archive. This approach offers several key advantages:

No Need for External Libraries: By using standard Python libraries (pandas and zipfile), we avoid the overhead and potential compatibility issues associated with specialized formats like HDF5 or Parquet.
Clear Understanding and Representation: CSV files are simple text files that are easy to read, universally supported, and straightforward to manage. This ensures transparency in how our data is stored and handled.
Avoiding dtypes Issues: HDF5 and Parquet formats posed challenges with certain data types (Int64, for instance). CSV files store data in a plain-text format, bypassing these issues and simplifying the data handling process.
Familiarity and Positive Feedback: CSV is a widely used format that our team is already familiar with, and we’ve received good feedback on its reliability and simplicity. It’s a format we trust for storing our data.
Code Clarity: The implementation using CSV and a .zip archive is clean and easy to understand. This approach results in a maintainable and straightforward codebase.

Code Implementation

Here’s the updated code using CSV files within a .zip archive:

`dump()` Method

def dump(self, archive_path):
    with zipfile.ZipFile(archive_path, 'w') as archive:
        # Save DataFrames to CSV and write them to the archive
        self._ledger.to_csv('ledger.csv', index=False)
        archive.write('ledger.csv')

        self._accounts.to_csv('accounts.csv', index=False)
        archive.write('accounts.csv')

        self._vat_codes.to_csv('vat_codes.csv', index=False)
        archive.write('vat_codes.csv')

        # Clean up individual CSV files after archiving
        os.remove('ledger.csv')
        os.remove('accounts.csv')
        os.remove('vat_codes.csv')

    print(f"Data dumped to {archive_path} successfully.")

`restore()` Method

def restore(self, archive_path):
    with zipfile.ZipFile(archive_path, 'r') as archive:
        archive.extractall()

        self._ledger = self.standardize_ledger(pd.read_csv('ledger.csv'))
        self._accounts =  self.standardize_account_chart(pd.read_csv('accounts.csv'))
        self._vat_codes = self.standardize_vat_codes(pd.read_csv('vat_codes.csv'))

        # Clean up extracted CSV files after loading
        os.remove('ledger.csv')
        os.remove('accounts.csv')
        os.remove('vat_codes.csv')

    print(f"Data restored from {archive_path} successfully.")

Conclusion

By choosing to store our data in a .zip archive containing individual CSV files, we simplify the storage and retrieval process. This method avoids the complications associated with more complex formats like HDF5 and Parquet, while leveraging familiar, straightforward tools and formats that ensure transparency and ease of use. This solution is not only practical but aligns with our team's expertise, providing a robust and maintainable approach to data management.

macxred / pyledger