Closed lasuk closed 1 month ago
Add method(s) to dump the entire accounting data to CSV files to the abstract LedgerEngine class (accounts, tax codes, ledger).
DataFrames
we can create only 1 generic method for all our entities.to_csv(df: pd.DataFrame, filename: str = "file", path: str = "."
Example:
def to_csv(df: pd.DataFrame, filename: str = "file", path: str = "."):
"""Save a DataFrame to a CSV file.
Args:
df (pd.DataFrame): The DataFrame to save.
filename (str): The name of the file to save without extension. Default is 'file'.
path (str): The directory path where the file will be saved. Default is the current directory.
Raises:
ValueError: If the DataFrame is empty.
FileNotFoundError: If the specified path does not exist.
"""
if df.empty:
raise ValueError("The DataFrame is empty and cannot be saved.")
if not os.path.exists(path):
raise FileNotFoundError(f"The specified path does not exist: {path}")
file_path = os.path.join(path, f"{filename}.csv")
try:
df.to_csv(file_path, index=False)
print(f"DataFrame successfully saved to {file_path}")
except Exception as e:
raise Exception(f"An error occurred while saving the DataFrame: {e}")
Questions:
index=False
as a parameter to this method?(Optional) also dump into a single HDF5 file, the data format typically used for persistant storage of pandas data frames. Questions:
{'df_name': df}
or a nested dataFrames with name
, df
columnsstore.put('df1', df1)
retrieved_df1 = store['df1']
to_hdf5(df: pd.DataFrame, filename: str = "file", path: str = "."
Add method to restore dumped data. Questions:
Would be better to test it with the MemoryLedger
class since this would be the freshest* one
LedgerEngine scope:
StandaloneLedger scope:
We'll need to have 2 public methods: dump()
, restore()
. No need to create methods for each dataType, because they are already there from DataFrame.to_csv()
method
.Zip
file vs .HDF5
file ?
restore()
method should be abstract in the LedgerEngine
class because it is agnostic of a storing format/method
dump()
and restore()
MethodsWe need to manage three key entities: ledger, accounts, and VAT codes. Need to implement two methods:
dump()
: This method should store all the accounting data from the DataFrames into a single file.restore()
: This method should read the data back from the file and restore system to the original state.The goal is to efficiently store and retrieve this data while ensuring data integrity, proper handling of data types.
In our accounting system, we need to efficiently store and retrieve multiple data entities, including ledger, accounts, and VAT codes. Initially, I explored using advanced storage formats like HDF5 and Parquet. However, these formats introduced significant challenges, particularly with handling various data types (dtypes
). These complications added unnecessary complexity to what should be a straightforward task.
Given these challenges, the need for a simpler and more transparent solution became clear—one that avoids the pitfalls of specialized libraries and complex formats while still providing a clear and manageable way to store and retrieve our data.
.zip
Archive with CSV FilesTo address these challenges, I decided to store each DataFrame as a separate CSV file within a .zip
archive. This approach offers several key advantages:
No Need for External Libraries: By using standard Python libraries (pandas
and zipfile
), we avoid the overhead and potential compatibility issues associated with specialized formats like HDF5 or Parquet.
Clear Understanding and Representation: CSV files are simple text files that are easy to read, universally supported, and straightforward to manage. This ensures transparency in how our data is stored and handled.
Avoiding dtypes
Issues: HDF5 and Parquet formats posed challenges with certain data types (Int64
, for instance). CSV files store data in a plain-text format, bypassing these issues and simplifying the data handling process.
Familiarity and Positive Feedback: CSV is a widely used format that our team is already familiar with, and we’ve received good feedback on its reliability and simplicity. It’s a format we trust for storing our data.
Code Clarity: The implementation using CSV and a .zip
archive is clean and easy to understand. This approach results in a maintainable and straightforward codebase.
Here’s the updated code using CSV files within a .zip
archive:
dump()
Methoddef dump(self, archive_path):
with zipfile.ZipFile(archive_path, 'w') as archive:
# Save DataFrames to CSV and write them to the archive
self._ledger.to_csv('ledger.csv', index=False)
archive.write('ledger.csv')
self._accounts.to_csv('accounts.csv', index=False)
archive.write('accounts.csv')
self._vat_codes.to_csv('vat_codes.csv', index=False)
archive.write('vat_codes.csv')
# Clean up individual CSV files after archiving
os.remove('ledger.csv')
os.remove('accounts.csv')
os.remove('vat_codes.csv')
print(f"Data dumped to {archive_path} successfully.")
restore()
Methoddef restore(self, archive_path):
with zipfile.ZipFile(archive_path, 'r') as archive:
archive.extractall()
self._ledger = self.standardize_ledger(pd.read_csv('ledger.csv'))
self._accounts = self.standardize_account_chart(pd.read_csv('accounts.csv'))
self._vat_codes = self.standardize_vat_codes(pd.read_csv('vat_codes.csv'))
# Clean up extracted CSV files after loading
os.remove('ledger.csv')
os.remove('accounts.csv')
os.remove('vat_codes.csv')
print(f"Data restored from {archive_path} successfully.")
By choosing to store our data in a .zip
archive containing individual CSV files, we simplify the storage and retrieval process. This method avoids the complications associated with more complex formats like HDF5 and Parquet, while leveraging familiar, straightforward tools and formats that ensure transparency and ease of use. This solution is not only practical but aligns with our team's expertise, providing a robust and maintainable approach to data management.
Tasks:
MemoryLedger
orTestLedger
class.fx_adustments
and price date toStandaloneLedger
.