The app can now scan for all CSV inside subdirectories
For example if root dir is set to Bidmove_Complete, not only the app will scan for CSV present in the current dir
But also will scan for all CSV in Bidmove_Complete/2021/*.csv and Bidmove_Complete/2022/July/*.csv
Implemented variables relativeCSVFilePath and absoluteCSVFilePath for this functionality
During my runtime test, I notice that the merging process took a lot of time for large data set (~30s for 62 files). Because drop_duplicates() is called to the after-merged dataframe, which already had like over 20m rows at that point, the operation took quite long to complete.
By calling drop_duplicates() for each CSV file load, the final merged dataframe is much smaller and thus the merging process speed up significantly. Each read_csv() takes ~0.05s longer to load but the merging process reduces from ~30s to sub-zero second.
Thanks @merrecdarkin !
New function filterCSVDate take the user input filter as date range, to process relativeCSVFilePath and return validCSVFilePath that matches the range.
The idea is to reduce the number of CSV load by date range filter (set by the user), before having to actually load all of them then run date query later.
This is also a significant improvement not only to the process runtime but also to the maintainability of the code base in general. I can now remove all the datetime conversion in the GUI. Moreover, in the back-end, I completely get rid of all SETTLEMENTDATE queries, along with parse_dates arg in read_csv. Thus everything resulted in much cleaner code.
1. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/bb5004b001e355460a84b60420de65b7bcb55d31 Recursive CSV File Match in Sub-directories
The app can now scan for all CSV inside subdirectories
For example if root dir is set to
Bidmove_Complete
, not only the app will scan for CSV present in the current dir But also will scan for all CSV inBidmove_Complete/2021/*.csv
andBidmove_Complete/2022/July/*.csv
Implemented variables
relativeCSVFilePath
andabsoluteCSVFilePath
for this functionality2. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/d0b7b95288bc63e9efdfa906ee70c7de19e81b8b Use C Engine to Improve CSV Read Speed
From pandas.read_csv docstring
Well, in some of my tests, C engine read speed is over 5x faster than python engine. So yeah, python engine sucks.
To mirage to C, I need to sacrifice
skipfooter
option that was previously implemented, also fixed the index to use inskiprows
forBIDPEROFFER_D
table3. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/1caed30418f4627d905f683b5e42d28408516c7a Drop duplicates in each file load
Another performance improvement update.
During my runtime test, I notice that the merging process took a lot of time for large data set (~30s for 62 files). Because
drop_duplicates()
is called to the after-merged dataframe, which already had like over 20m rows at that point, the operation took quite long to complete.By calling
drop_duplicates()
for each CSV file load, the final merged dataframe is much smaller and thus the merging process speed up significantly. Eachread_csv()
takes ~0.05s longer to load but the merging process reduces from ~30s to sub-zero second.4. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/4f7bf49af0a31527fb06da8044843be731e1036d Added Dedicated Function to Filter CSV Date Range
Thanks @merrecdarkin ! New function
filterCSVDate
take the user input filter as date range, to processrelativeCSVFilePath
and returnvalidCSVFilePath
that matches the range.The idea is to reduce the number of CSV load by date range filter (set by the user), before having to actually load all of them then run date query later.
This is also a significant improvement not only to the process runtime but also to the maintainability of the code base in general. I can now remove all the
datetime
conversion in the GUI. Moreover, in the back-end, I completely get rid of allSETTLEMENTDATE
queries, along withparse_dates
arg inread_csv
. Thus everything resulted in much cleaner code.5. https://github.com/merrecdarkin/Wil-Project-AEMO-CSV-Efficient-Reading-Sorting-and-Export-/commit/54a9bf9164d89fa3f1f678d86fa65211478949a7 GUI Overhaul
The GUI received a bit of rework to better suit with the new on-file date filter workflow.
SET DATE
button that will update the CSV every time a folder change or new date filter is setNow the expected workflow is:
BROWSE
SET DATE
button to filter out CSV that matches the date rangeDUID
and/orBIDTYPE
filter then clickEXPORT