hkpeaks / peaks-consolidation

The Peaks Consolidation is equipped with state-of-the-art algorithms and data structures that support high-performance databending exercises. It specializes in management accounting and consolidation, with some special topics in machine learning and bioinformatics.
https://www.linkedin.com/in/max01/recent-activity/all/
MIT License
102 stars 8 forks source link

Add more examples from simples to more detailled #5

Open rferraton opened 1 year ago

rferraton commented 1 year ago

I suggest to add more exemples in the readme and/or in a dedicated fonder.

Read/Select/write Read/select/orderby/write Read/selec/filter/write Read/select/filter/groupby/write Read/select/filter/groupby/write Read/select/join/write Read/select/join/filter/groupby/orderby/write Read/select/split/write ....

hkpeaks commented 1 year ago

I am planning to create a tutorial.io. I noticed that new visitors do not accept long readme. Currently, the combination of commands will affect whether the dataset is over memory.

So I am on the way to redesign the new streaming engine to separate T from E-L phases. The current system will focus on the Transformation phase, so the new streaming engine will be responsible for E and L phases. Disadvantage is, an app will need to config 3 script files for each of E, T, L. In the E script, it will also use to define the name of T and L script. E phase is responsible for distributing table partition for T. T return dataset to E for the L.

After this change, use of commands will not be affected by file format and whether it is over memory when using commands for the transformation phase. And I will add support for file formats in the E and L phases. The E-L phase can be run within Peaks.exe or a separate runtime, using gRPC to connect together. I believe this new streaming model can assist in solving the current issues of large dataset I have found on DuckDB and Polars. The new streaming engine my plan is to support trillions of rows, so the distributions of partitions for the transformation phase can be processing by more than one computing devices. I hope this move can drive more open source softwares moving on to the age of distributed computing.

hkpeaks commented 1 year ago

Data Transformation: Pass Table By Memory

Demo Video: https://youtu.be/5Jhd1WwgfYg

Read{Master.csv ~ Master} Read{Fact.csv ~ Table} Write{Table ~ %ExpandBy10Time.csv}

Test 1: JoinTable to Add 2 Column and Select Column

JoinTable{Outbox/%ExpandBy10Time.csv | Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) ~ Result-JoinTable} Select{Result-JoinTable | Date,Shop,Product,Quantity,Amount ~ Result-SelectColumn}

Test 2: BuildKeyKeyValue + JoinKeyValue + AddColumn = JoinTable of Test 1

BuildKeyValue{Master | Product, Style ~ MasterTableKeyValue} JoinKeyValue{Outbox/%ExpandBy10Time.csv | Product, Style => AllMatch(MasterTableKeyValue) ~ Result-BuildKeyValue} AddColumn{Result-BuildKeyValue | Quantity, Unit_Price => Multiply(Amount) ~ Result-AddColumn}

Test 3: Filter and FilterUnmatch

Filter{Result-AddColumn | Amount(Float > 50000) ~ Result-Filter} FilterUnmatch{Result-AddColumn | Amount(Float > 50000) ~ Result-FilterUnmatch}

Test 4: Distinct and OrderBy

Distinct{Result-Filter | Date, Shop, Product, Style ~ Result-Distinct-Match} Distinct{Result-FilterUnmatch | Date, Shop, Product, Style ~ Result-Distinct-Unmatch} OrderBy{Result-Distinct-Unmatch | Shop(A)Product(A)Date(D) ~ Result-Distinct-Unmatch-OrderAAD}

Test 5: GroupBy

GroupBy{Result-Filter | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Match} GroupBy{Result-FilterUnmatch | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Unmatch}

Test 6: Write to Disk

Write{Result-JoinTable ~ Result-JoinTable.csv} Write{Result-SelectColumn ~ Result-SelectColumn.csv} Write{MasterTableKeyValue ~ MasterTableKeyValue.csv} Write{Result-AddColumn ~ Result-AddColumn.csv} Write{Result-Filter ~ Result-Filter.csv} Write{Result-FilterUnmatch ~ Result-FilterUnmatch.csv} Write{Result-Distinct-Match ~ Result-Distinct-Match.csv} Write{Result-Distinct-Unmatch ~ Result-Distinct-Unmatch.csv} Write{Result-Distinct-Unmatch-OrderAAD ~ Result-Distinct-Unmatch-OrderAAD.csv} Write{Result-GroupBy-Match ~ Result-GroupBy-Match.csv} Write{Result-GroupBy-Unmatch ~ Result-GroupBy-Unmatch.csv}

Data Transformation: Pass Table By Disk

Read{Master.csv ~ Master} Read{Fact.csv ~ Table} Write{Table ~ %ExpandBy123Time.csv}

Test 1: JoinTable to 2 Columns

JoinTable{Outbox/%ExpandBy123Time.csv | Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) ~ Result-JoinTable.csv} OrderBy{Outbox/Result-JoinTable.csv | Date(D) => CreateFolderLake(Shop) ~ Result-JoinTable-OrderBy.csv} Select{Outbox/Result-JoinTable-OrderBy.csv | Date,Shop,Product,Quantity,Amount ~ Result-SelectColumn.csv}

Test 2: BuildKeyKeyValue + JoinKeyValue + AddColumn = JoinTable of Test 1

BuildKeyValue{Master | Product, Style ~ MasterTableKeyValue} JoinKeyValue{Outbox/%ExpandBy123Time.csv | Product, Style => AllMatch(MasterTableKeyValue) ~ Result-BuildKeyValue.csv} AddColumn{ Outbox/Result-BuildKeyValue.csv | Quantity, Unit_Price => Multiply(Amount) ~ Result-AddColumn.csv}

Test 3: Filter

Filter{Outbox/Result-AddColumn.csv | Amount(Float > 50000) ~ Result-Filter.csv} FilterUnmatch{Outbox/Result-AddColumn.csv | Amount(Float > 50000) ~ Result-FilterUnmatch.csv}

Test 4: Distinct

Distinct{Outbox/Result-Filter.csv | Date, Shop, Product, Style ~ Result-Distinct-Match} Write{Result-Distinct-Match ~ Result-Distinct-Match.csv} Distinct{Outbox/Result-FilterUnmatch.csv | Date, Shop, Product, Style ~ Result-Distinct-Unmatch} Write{Result-Distinct-Unmatch ~ Result-Distinct-Unmatch.csv} OrderBy{Outbox/Result-Distinct-Unmatch.csv | Shop(A)Product(A)Date(D) ~ Result-Distinct-Unmatch-OrderAAD.csv}

Test 5: GroupBy

GroupBy{Outbox/Result-Filter.csv | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Match} Write{Result-GroupBy-Match ~ Result-GroupBy-Match.csv} GroupBy{Outbox/Result-FilterUnmatch.csv | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Unmatch} Write{Result-GroupBy-Unmatch ~ Result-GroupBy-Unmatch.csv}

Test 6: CreateFolderLake and Query

CreateFolderLake{Outbox/Result-JoinTable-OrderBy.csv | Shop ~ FolderLake} Filter{Outbox/FolderLake/S15/*.csv | Product(222..888) Style(=F) ~ Result-FilterFolderLake.csv}

Test 7: ReadSample and View

ReadSample{Outbox/Result-FilterFolderLake.csv | StartPosition%(0) ByteLength(100000) ~ SampleTable} View{SampleTable} ReadSample{Outbox/Result-Distinct-Unmatch-OrderAAD.csv ~ SampleTable} View{SampleTable}

rferraton commented 1 year ago

Thanks for the exemples but i am not familiar with peaks syntax. What does mean | => % ~ ? May be simpler exemples could also help. May you could use colors to explain a little bit more by associating code with plain english explanations ? Maybe you could also use SQL equivalent to explain...

hkpeaks commented 1 year ago

Command{Extraction | Transformation ~ Load}

Write{Table ~ %ExpandBy123Time.csv} % is to avoid collision with user file name

=> is use to separate the transformation setting into 2 distinctive part e.g. JoinTable{Outbox/%ExpandBy123Time.csv | Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) ~ Result-JoinTable.csv} GroupBy{Result-Filter | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Match} AddColumn{Result-BuildKeyValue | Quantity, Unit_Price => Multiply(Amount) ~ Result-AddColumn}

before => is normally Refer Column of a table, after => usually will Add Column and Calc Column Result

Since I find SQL statement is very complex and not flexible, so I invent this new ETL syntax. Pandas setting is also complex but more flexible.

Command list is added to readme. Current development version has 18 commands. Target date to publish 2nd release is October 2023.

rferraton commented 1 year ago

Thanks, better but still difficult to read. I strongly think a Rosetta Stone is needed for panda guys and SQL guys (like me) to understand your syntax.

Without comprehension there will be less adoption.

I strongly suggest a 4/5 columns tables :

Maybe you could also use powershell notebook (using azure datastudio or vs code) to give a more complexe scenario with several steps.

List the commands availables with pandas and SQL equivalents

List the special operators with samples.

hkpeaks commented 1 year ago

Since I started my YouTube channel to share my knowledge of high performance ETL and innovative syntax, I have gained more subscribers. The results is much better than I hosted a cooking channel before https://www.youtube.com/@cuisine779/videos - only one subscriber. Therefore, updating the source code and releasing it to GitHub is not my main priority anymore. Software does not have to be actually used by users. It is like cooking videos, where food does not have to be eaten, some people just enjoy watching it. Similarly, some users like to watch software demos without using them.

This demo "Open and View 250GB CSV File Instantly" https://youtu.be/lX2HKLDOfwk received most view hours and subscription. Currently I have no use Google Ads to promote the channel. The command use in this video is "ReadSample{}" in fact I cannot find equvalent command in Pandas and SQL statement. In CLI interactive mode, just type ReadSample filename.csv, the system will draw sample of rows starting from particular % of file position randomly. So I recorded another video https://youtu.be/Qo25RWuugjA to show how to use CLI interative mode. The CLI interactive mode support a simplifed version of script, the system will fill in the blank automatically. So it shall be a better training environment for initial learning of the script syntax. I plan to add more enhancement for this CLI interactive mode. The demo you see it supports recording script function. So after you exit the system, you can resume what you are doing previously. The disadvantage of my videos is no voice. I will consider to record voice when the software become more mature.

When Peaks to integrate with PostgreSQL server, it may be a good time to compare Peaks syntax with SQL statements. To operate PostgreSQL server by Peaks, I will use Peaks syntax and then auto-gen SQL statement for PostgreSQL server. I have done the same dev for previous WebNameSQL (integrated with MS SQL server). To compare Pandas, I find it may be more difficult to do so as Peaks script is totally incompatible with Pandas script. The data transformation model designed for Peaks script is mainly based on my past experience of accounting.

I actually spend very little time on coding. I spend more time on social networks and recording demo videos than coding. I love Go. It is my best software tool, better than Excel. I have never encountered any bugs with Go. It is very high quality compared to other data frame software. When I find bugs in my test cases, I can fix my code right away.

The current script parsing system is inherit from last 2 projects. What my focus of research is the performance game of the extraction and query engine itself. And concern turning the project from hobby to work. If trigger more % of work nature, it will be end of the project. So the substance is algorithm and data struc, form is script.

Polars become polular, what do you think to handle below issues, is it work or hobby? https://github.com/pola-rs/polars/issues

After completion of the extraction and query engine, my next step is to create some web apps with drag-and-drop and other GUI features. I admire Alteryx’s GUI and I want to explore new ways to design GUI in the future. My current syntax design will make it easier to develop GUI later - the user actions on the front-end will automatically generate the script. I have always wanted to write a single line of script that can do full E-T-L since 13 years ago. My former colleagues said it was impossible, but I proved them wrong.

Alternative hobby I am considering to reinvent the wheel - cluster computing. Different Peaks.exe work together in the network. But cloud service demand for credit card with I consider it is very risky to input credit card info in the browser, so I will use my local lan to play this game.

rferraton commented 1 year ago

I think you could have lot more adoption (and views) if you give users a reference point of view. I don't say change your syntax to be like SQL or Pandas. I say give some exemple to help users use peaks. I would be far better than have 50 more views.

What you want Peaks T-SQL
Read data and use the result in a temp variable/table Read{master.csv ~master} SELECT * INTO #master from 'master.csv'
Read sample data ReadSample{data.csv} SELECT * FROM data TABLESAMPLE(1 PERCENT)
Write data Write{master~ %mymaster.csv} SELECT * INTO #mymaster FROM #master
Join 2 tables using keys BuildKeyValue{Master | Product, Style ~ MasterTableKeyValue}JoinKeyValue{Outbox/%data.csv | Product, Style => AllMatch(MasterTableKeyValue) ~ Result-BuildKeyValue.csv} SELECT * INTO #Result-BuildKeyValue FROM #data d inner join #master m on (d.Product=m.Product and d.Style=m.Style)
hkpeaks commented 1 year ago

Since I need to take care my wife in coming few months for her eye surgery, it may not be a good time to find users to use Peaks. And in fact I have borned alternative view of the Peaks future, it can be similar to YouTube Catering Channel, it can be viewed only, no need distributing food for users to eat. So I can save time as no need to support users. This afternoon I publish a new demo video to show the power of %ExpandBy1000Time.csv using a 1 Million Rows file. Processing time is only 35s. After my wife to complete her recovery of eye health condition, I will discuss with my wife whether to update the source code to Github and publish 2nd pre-release. I think first pre-release is sufficient for any user to test whether the high performance for billion-row is real. But this version support next line by double byte e.g. 13 & 10. Polars output is only single byte e.g. 10, so this version cannot read Polars output. The source code subsequently has updated the new function to support %ExpandBy100Time.csv, but the source code does not cover code for all query function e.g. GroupBy, Select, Distinct and etc.

WOW, you have a beauty mood to offer me a sample. SQL Server use Current Database, so no need to specify on the command. Peaks use Current Table (no database), its allow to skip specifying source and return table name e.g. Filter{Column(condition)}.

For Table Join I have implemented 2 approaches of setting.

This one is more similar to SQL statement. JoinTable{Fact.csv | Quantity, Unit_Price => InnerJoin(Master) Multiply(Amount) ~ Result-JoinTable} But I think using 2 commands BuilKeyValue and JoinKeyValue can tell people in fact there are 2 independent steps. Without build key for one of table, it will be no way to join.

To build a correct comparision, I need to host a PostgreSQL server (based on the worked SQL statement of this server). So I will do that when I execute the Peaks-PostgreSQL integration. I will implement parallel query for PostgreSQL to achieve another milestone of acceleration. I am always playing speedy game. PostgreSQL is very popular in our country.

From time to time I see comparision with SQL statement. https://www.linkedin.com/posts/aditi-khinvasara-acca-bb99601aa_sql-pyspark-equivalent-ugcPost-7091703222431588354-qNYO?utm_source=share&utm_medium=member_desktop

hkpeaks commented 1 year ago

I do not aware SQL has TableSample. ReadSample has 3 different implementation. First 1/2 are single, third one is multiple. First 1 will be auto assign StartPosition and ByteLength.

ReadSample FileName.csv

ReadSample{StartPosition%(Number) ByteLength(Number)}

ReadSample{Repeat(Number) ByteLength(Number)}

hkpeaks commented 1 year ago

Thanks for you recent reaction to my project, this give me a new brainstorming. I will rewrite the whole system to support cluster computing. Most importantly is to redesign the script and streaming model from command level to user-defined function level. See below examples.

UserDefineFunctionName: SourceFile/Table ~ ResultFile/Table | Command: Setting

ExpandFile: Fact.csv ~ 1BillionRows.csv | ExpandFactor: 123

JoinScenario1: 1BillionRows.csv ~ Test1Results.csv | JoinTable: Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) | OrderBy: Date(D) => CreateFolderLake(Shop) | Select: Date,Shop,Style,Product,Quantity,Amount

BuildKeyValueTable: Master.csv ~ KeyValueTable | BuildKeyValue: Product, Style

JoinScenario2: 1BillionRows.csv ~ Test2AResults.csv | JoinKeyValue: Product, Style => AllMatch(KeyValueTable) | AddColumn: Quantity, Unit_Price => Multiply(Amount) | Filter: Amount(Float > 50000) | GroupBy: Product, Style => Count() Sum(Quantity) Sum(Amount) | OrderBy: Shop(A)Product(A)Date(D)

SplitFile: Test1Results.csv ~ FolderLake | CreateFolderLake: Shop

FilterFolder: Outbox/FolderLake/S15/*.csv ~ Result-FilterFolderLake.csv | Filter: Product(222..888) Style(=F)

ReadSample2View: Outbox/Result-FilterFolderLake.csv ~ SampleTable | ReadSample: StartPosition%(0) ByteLength(100000) | View

hkpeaks commented 1 year ago

After a careful consideration, comparing the new ETL Framework with Polars script is more easy for me as Polars API is more granular than SQL statements. In order to ensure accruacy of comparison, users can config the new ETL Framework to run Polars.