Open rferraton opened 1 year ago
I am planning to create a tutorial.io. I noticed that new visitors do not accept long readme. Currently, the combination of commands will affect whether the dataset is over memory.
So I am on the way to redesign the new streaming engine to separate T from E-L phases. The current system will focus on the Transformation phase, so the new streaming engine will be responsible for E and L phases. Disadvantage is, an app will need to config 3 script files for each of E, T, L. In the E script, it will also use to define the name of T and L script. E phase is responsible for distributing table partition for T. T return dataset to E for the L.
After this change, use of commands will not be affected by file format and whether it is over memory when using commands for the transformation phase. And I will add support for file formats in the E and L phases. The E-L phase can be run within Peaks.exe or a separate runtime, using gRPC to connect together. I believe this new streaming model can assist in solving the current issues of large dataset I have found on DuckDB and Polars. The new streaming engine my plan is to support trillions of rows, so the distributions of partitions for the transformation phase can be processing by more than one computing devices. I hope this move can drive more open source softwares moving on to the age of distributed computing.
Demo Video: https://youtu.be/5Jhd1WwgfYg
Read{Master.csv ~ Master} Read{Fact.csv ~ Table} Write{Table ~ %ExpandBy10Time.csv}
JoinTable{Outbox/%ExpandBy10Time.csv | Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) ~ Result-JoinTable} Select{Result-JoinTable | Date,Shop,Product,Quantity,Amount ~ Result-SelectColumn}
BuildKeyValue{Master | Product, Style ~ MasterTableKeyValue} JoinKeyValue{Outbox/%ExpandBy10Time.csv | Product, Style => AllMatch(MasterTableKeyValue) ~ Result-BuildKeyValue} AddColumn{Result-BuildKeyValue | Quantity, Unit_Price => Multiply(Amount) ~ Result-AddColumn}
Filter{Result-AddColumn | Amount(Float > 50000) ~ Result-Filter} FilterUnmatch{Result-AddColumn | Amount(Float > 50000) ~ Result-FilterUnmatch}
Distinct{Result-Filter | Date, Shop, Product, Style ~ Result-Distinct-Match} Distinct{Result-FilterUnmatch | Date, Shop, Product, Style ~ Result-Distinct-Unmatch} OrderBy{Result-Distinct-Unmatch | Shop(A)Product(A)Date(D) ~ Result-Distinct-Unmatch-OrderAAD}
GroupBy{Result-Filter | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Match} GroupBy{Result-FilterUnmatch | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Unmatch}
Write{Result-JoinTable ~ Result-JoinTable.csv} Write{Result-SelectColumn ~ Result-SelectColumn.csv} Write{MasterTableKeyValue ~ MasterTableKeyValue.csv} Write{Result-AddColumn ~ Result-AddColumn.csv} Write{Result-Filter ~ Result-Filter.csv} Write{Result-FilterUnmatch ~ Result-FilterUnmatch.csv} Write{Result-Distinct-Match ~ Result-Distinct-Match.csv} Write{Result-Distinct-Unmatch ~ Result-Distinct-Unmatch.csv} Write{Result-Distinct-Unmatch-OrderAAD ~ Result-Distinct-Unmatch-OrderAAD.csv} Write{Result-GroupBy-Match ~ Result-GroupBy-Match.csv} Write{Result-GroupBy-Unmatch ~ Result-GroupBy-Unmatch.csv}
Read{Master.csv ~ Master} Read{Fact.csv ~ Table} Write{Table ~ %ExpandBy123Time.csv}
JoinTable{Outbox/%ExpandBy123Time.csv | Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) ~ Result-JoinTable.csv} OrderBy{Outbox/Result-JoinTable.csv | Date(D) => CreateFolderLake(Shop) ~ Result-JoinTable-OrderBy.csv} Select{Outbox/Result-JoinTable-OrderBy.csv | Date,Shop,Product,Quantity,Amount ~ Result-SelectColumn.csv}
BuildKeyValue{Master | Product, Style ~ MasterTableKeyValue} JoinKeyValue{Outbox/%ExpandBy123Time.csv | Product, Style => AllMatch(MasterTableKeyValue) ~ Result-BuildKeyValue.csv} AddColumn{ Outbox/Result-BuildKeyValue.csv | Quantity, Unit_Price => Multiply(Amount) ~ Result-AddColumn.csv}
Filter{Outbox/Result-AddColumn.csv | Amount(Float > 50000) ~ Result-Filter.csv} FilterUnmatch{Outbox/Result-AddColumn.csv | Amount(Float > 50000) ~ Result-FilterUnmatch.csv}
Distinct{Outbox/Result-Filter.csv | Date, Shop, Product, Style ~ Result-Distinct-Match} Write{Result-Distinct-Match ~ Result-Distinct-Match.csv} Distinct{Outbox/Result-FilterUnmatch.csv | Date, Shop, Product, Style ~ Result-Distinct-Unmatch} Write{Result-Distinct-Unmatch ~ Result-Distinct-Unmatch.csv} OrderBy{Outbox/Result-Distinct-Unmatch.csv | Shop(A)Product(A)Date(D) ~ Result-Distinct-Unmatch-OrderAAD.csv}
GroupBy{Outbox/Result-Filter.csv | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Match} Write{Result-GroupBy-Match ~ Result-GroupBy-Match.csv} GroupBy{Outbox/Result-FilterUnmatch.csv | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Unmatch} Write{Result-GroupBy-Unmatch ~ Result-GroupBy-Unmatch.csv}
CreateFolderLake{Outbox/Result-JoinTable-OrderBy.csv | Shop ~ FolderLake} Filter{Outbox/FolderLake/S15/*.csv | Product(222..888) Style(=F) ~ Result-FilterFolderLake.csv}
ReadSample{Outbox/Result-FilterFolderLake.csv | StartPosition%(0) ByteLength(100000) ~ SampleTable} View{SampleTable} ReadSample{Outbox/Result-Distinct-Unmatch-OrderAAD.csv ~ SampleTable} View{SampleTable}
Thanks for the exemples but i am not familiar with peaks syntax. What does mean | => % ~ ? May be simpler exemples could also help. May you could use colors to explain a little bit more by associating code with plain english explanations ? Maybe you could also use SQL equivalent to explain...
Command{Extraction | Transformation ~ Load}
Write{Table ~ %ExpandBy123Time.csv} % is to avoid collision with user file name
=> is use to separate the transformation setting into 2 distinctive part e.g. JoinTable{Outbox/%ExpandBy123Time.csv | Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) ~ Result-JoinTable.csv} GroupBy{Result-Filter | Product, Style => Count() Sum(Quantity) Sum(Amount) ~ Result-GroupBy-Match} AddColumn{Result-BuildKeyValue | Quantity, Unit_Price => Multiply(Amount) ~ Result-AddColumn}
before => is normally Refer Column of a table, after => usually will Add Column and Calc Column Result
Since I find SQL statement is very complex and not flexible, so I invent this new ETL syntax. Pandas setting is also complex but more flexible.
Command list is added to readme. Current development version has 18 commands. Target date to publish 2nd release is October 2023.
Thanks, better but still difficult to read. I strongly think a Rosetta Stone is needed for panda guys and SQL guys (like me) to understand your syntax.
Without comprehension there will be less adoption.
I strongly suggest a 4/5 columns tables :
Maybe you could also use powershell notebook (using azure datastudio or vs code) to give a more complexe scenario with several steps.
List the commands availables with pandas and SQL equivalents
List the special operators with samples.
Since I started my YouTube channel to share my knowledge of high performance ETL and innovative syntax, I have gained more subscribers. The results is much better than I hosted a cooking channel before https://www.youtube.com/@cuisine779/videos - only one subscriber. Therefore, updating the source code and releasing it to GitHub is not my main priority anymore. Software does not have to be actually used by users. It is like cooking videos, where food does not have to be eaten, some people just enjoy watching it. Similarly, some users like to watch software demos without using them.
This demo "Open and View 250GB CSV File Instantly" https://youtu.be/lX2HKLDOfwk received most view hours and subscription. Currently I have no use Google Ads to promote the channel. The command use in this video is "ReadSample{}" in fact I cannot find equvalent command in Pandas and SQL statement. In CLI interactive mode, just type ReadSample filename.csv, the system will draw sample of rows starting from particular % of file position randomly. So I recorded another video https://youtu.be/Qo25RWuugjA to show how to use CLI interative mode. The CLI interactive mode support a simplifed version of script, the system will fill in the blank automatically. So it shall be a better training environment for initial learning of the script syntax. I plan to add more enhancement for this CLI interactive mode. The demo you see it supports recording script function. So after you exit the system, you can resume what you are doing previously. The disadvantage of my videos is no voice. I will consider to record voice when the software become more mature.
When Peaks to integrate with PostgreSQL server, it may be a good time to compare Peaks syntax with SQL statements. To operate PostgreSQL server by Peaks, I will use Peaks syntax and then auto-gen SQL statement for PostgreSQL server. I have done the same dev for previous WebNameSQL (integrated with MS SQL server). To compare Pandas, I find it may be more difficult to do so as Peaks script is totally incompatible with Pandas script. The data transformation model designed for Peaks script is mainly based on my past experience of accounting.
I actually spend very little time on coding. I spend more time on social networks and recording demo videos than coding. I love Go. It is my best software tool, better than Excel. I have never encountered any bugs with Go. It is very high quality compared to other data frame software. When I find bugs in my test cases, I can fix my code right away.
The current script parsing system is inherit from last 2 projects. What my focus of research is the performance game of the extraction and query engine itself. And concern turning the project from hobby to work. If trigger more % of work nature, it will be end of the project. So the substance is algorithm and data struc, form is script.
Polars become polular, what do you think to handle below issues, is it work or hobby? https://github.com/pola-rs/polars/issues
After completion of the extraction and query engine, my next step is to create some web apps with drag-and-drop and other GUI features. I admire Alteryx’s GUI and I want to explore new ways to design GUI in the future. My current syntax design will make it easier to develop GUI later - the user actions on the front-end will automatically generate the script. I have always wanted to write a single line of script that can do full E-T-L since 13 years ago. My former colleagues said it was impossible, but I proved them wrong.
Alternative hobby I am considering to reinvent the wheel - cluster computing. Different Peaks.exe work together in the network. But cloud service demand for credit card with I consider it is very risky to input credit card info in the browser, so I will use my local lan to play this game.
I think you could have lot more adoption (and views) if you give users a reference point of view. I don't say change your syntax to be like SQL or Pandas. I say give some exemple to help users use peaks. I would be far better than have 50 more views.
What you want | Peaks | T-SQL |
---|---|---|
Read data and use the result in a temp variable/table | Read{master.csv ~master} |
SELECT * INTO #master from 'master.csv' |
Read sample data | ReadSample{data.csv} |
SELECT * FROM data TABLESAMPLE(1 PERCENT) |
Write data | Write{master~ %mymaster.csv} |
SELECT * INTO #mymaster FROM #master |
Join 2 tables using keys | BuildKeyValue{Master | Product, Style ~ MasterTableKeyValue} JoinKeyValue{Outbox/%data.csv | Product, Style => AllMatch(MasterTableKeyValue) ~ Result-BuildKeyValue.csv} |
SELECT * INTO #Result-BuildKeyValue FROM #data d inner join #master m on (d.Product=m.Product and d.Style=m.Style) |
Since I need to take care my wife in coming few months for her eye surgery, it may not be a good time to find users to use Peaks. And in fact I have borned alternative view of the Peaks future, it can be similar to YouTube Catering Channel, it can be viewed only, no need distributing food for users to eat. So I can save time as no need to support users. This afternoon I publish a new demo video to show the power of %ExpandBy1000Time.csv using a 1 Million Rows file. Processing time is only 35s. After my wife to complete her recovery of eye health condition, I will discuss with my wife whether to update the source code to Github and publish 2nd pre-release. I think first pre-release is sufficient for any user to test whether the high performance for billion-row is real. But this version support next line by double byte e.g. 13 & 10. Polars output is only single byte e.g. 10, so this version cannot read Polars output. The source code subsequently has updated the new function to support %ExpandBy100Time.csv, but the source code does not cover code for all query function e.g. GroupBy, Select, Distinct and etc.
WOW, you have a beauty mood to offer me a sample. SQL Server use Current Database, so no need to specify on the command. Peaks use Current Table (no database), its allow to skip specifying source and return table name e.g. Filter{Column(condition)}.
For Table Join I have implemented 2 approaches of setting.
This one is more similar to SQL statement. JoinTable{Fact.csv | Quantity, Unit_Price => InnerJoin(Master) Multiply(Amount) ~ Result-JoinTable} But I think using 2 commands BuilKeyValue and JoinKeyValue can tell people in fact there are 2 independent steps. Without build key for one of table, it will be no way to join.
To build a correct comparision, I need to host a PostgreSQL server (based on the worked SQL statement of this server). So I will do that when I execute the Peaks-PostgreSQL integration. I will implement parallel query for PostgreSQL to achieve another milestone of acceleration. I am always playing speedy game. PostgreSQL is very popular in our country.
From time to time I see comparision with SQL statement. https://www.linkedin.com/posts/aditi-khinvasara-acca-bb99601aa_sql-pyspark-equivalent-ugcPost-7091703222431588354-qNYO?utm_source=share&utm_medium=member_desktop
I do not aware SQL has TableSample. ReadSample has 3 different implementation. First 1/2 are single, third one is multiple. First 1 will be auto assign StartPosition and ByteLength.
ReadSample FileName.csv
ReadSample{StartPosition%(Number) ByteLength(Number)}
ReadSample{Repeat(Number) ByteLength(Number)}
Thanks for you recent reaction to my project, this give me a new brainstorming. I will rewrite the whole system to support cluster computing. Most importantly is to redesign the script and streaming model from command level to user-defined function level. See below examples.
UserDefineFunctionName: SourceFile/Table ~ ResultFile/Table | Command: Setting
ExpandFile: Fact.csv ~ 1BillionRows.csv | ExpandFactor: 123
JoinScenario1: 1BillionRows.csv ~ Test1Results.csv | JoinTable: Quantity, Unit_Price => InnerJoin(Master)Multiply(Amount) | OrderBy: Date(D) => CreateFolderLake(Shop) | Select: Date,Shop,Style,Product,Quantity,Amount
BuildKeyValueTable: Master.csv ~ KeyValueTable | BuildKeyValue: Product, Style
JoinScenario2: 1BillionRows.csv ~ Test2AResults.csv | JoinKeyValue: Product, Style => AllMatch(KeyValueTable) | AddColumn: Quantity, Unit_Price => Multiply(Amount) | Filter: Amount(Float > 50000) | GroupBy: Product, Style => Count() Sum(Quantity) Sum(Amount) | OrderBy: Shop(A)Product(A)Date(D)
SplitFile: Test1Results.csv ~ FolderLake | CreateFolderLake: Shop
FilterFolder: Outbox/FolderLake/S15/*.csv ~ Result-FilterFolderLake.csv | Filter: Product(222..888) Style(=F)
ReadSample2View: Outbox/Result-FilterFolderLake.csv ~ SampleTable | ReadSample: StartPosition%(0) ByteLength(100000) | View
After a careful consideration, comparing the new ETL Framework with Polars script is more easy for me as Polars API is more granular than SQL statements. In order to ensure accruacy of comparison, users can config the new ETL Framework to run Polars.
I suggest to add more exemples in the readme and/or in a dedicated fonder.
Read/Select/write Read/select/orderby/write Read/selec/filter/write Read/select/filter/groupby/write Read/select/filter/groupby/write Read/select/join/write Read/select/join/filter/groupby/orderby/write Read/select/split/write ....