merrecdarkin / Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-

WIL
MIT License
1 stars 0 forks source link

GUI Overhaul and Performance Improvement #12

Closed larryhuangdev closed 2 years ago

larryhuangdev commented 2 years ago

1. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/bb5004b001e355460a84b60420de65b7bcb55d31 Recursive CSV File Match in Sub-directories

The app can now scan for all CSV inside subdirectories

For example if root dir is set to Bidmove_Complete, not only the app will scan for CSV present in the current dir But also will scan for all CSV in Bidmove_Complete/2021/*.csv and Bidmove_Complete/2022/July/*.csv

Implemented variables relativeCSVFilePath and absoluteCSVFilePath for this functionality

2. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/d0b7b95288bc63e9efdfa906ee70c7de19e81b8b Use C Engine to Improve CSV Read Speed

From pandas.read_csv docstring

The C and pyarrow engines are faster, while the python engine is currently more feature-complete.

Well, in some of my tests, C engine read speed is over 5x faster than python engine. So yeah, python engine sucks.

To mirage to C, I need to sacrifice skipfooter option that was previously implemented, also fixed the index to use in skiprows for BIDPEROFFER_D table

3. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/1caed30418f4627d905f683b5e42d28408516c7a Drop duplicates in each file load

Another performance improvement update.

During my runtime test, I notice that the merging process took a lot of time for large data set (~30s for 62 files). Because drop_duplicates() is called to the after-merged dataframe, which already had like over 20m rows at that point, the operation took quite long to complete.

By calling drop_duplicates() for each CSV file load, the final merged dataframe is much smaller and thus the merging process speed up significantly. Each read_csv() takes ~0.05s longer to load but the merging process reduces from ~30s to sub-zero second.

4. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/4f7bf49af0a31527fb06da8044843be731e1036d Added Dedicated Function to Filter CSV Date Range

Thanks @merrecdarkin ! New function filterCSVDate take the user input filter as date range, to process relativeCSVFilePath and return validCSVFilePath that matches the range.

The idea is to reduce the number of CSV load by date range filter (set by the user), before having to actually load all of them then run date query later.

This is also a significant improvement not only to the process runtime but also to the maintainability of the code base in general. I can now remove all the datetime conversion in the GUI. Moreover, in the back-end, I completely get rid of all SETTLEMENTDATE queries, along with parse_dates arg in read_csv. Thus everything resulted in much cleaner code.

5. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/54a9bf9164d89fa3f1f678d86fa65211478949a7 GUI Overhaul

The GUI received a bit of rework to better suit with the new on-file date filter workflow.

Now the expected workflow is: