time-series-machine-learning / tsml-repo

Discussion, problems and donations of data hosted at
http://www.timeseriesclassification.com
GNU General Public License v3.0
44 stars 6 forks source link

[DONATION] Siemens smart automation data #65

Open adbith4 opened 3 years ago

adbith4 commented 3 years ago

Please give a brief description of the data The research plant Smart Automation (SmA) features four tanks as part of the infrastructure backbone which has been used to simulate several fault types. A four-tank batch process pumps water from a reservoir tank to three small tanks. The plant is controlled using a sequential flow chart (SFC) simulating a batch process according to ISA-88.

Where can we get the data? https://code.siemens.com/thomas.bierweiler/faultsof4-tankbatchprocess/-/blob/main/SmA-Four-Tank-Batch-Process.csv

Is there a publication related to the data A description is available at https://code.siemens.com/thomas.bierweiler/faultsof4-tankbatchprocess/-/blob/main/DescriptionOfSmAFourTankBatchProcess.pdf

Is your data multivariate or univariate?# multivariate

Is there a default train/test split for the data There is no default split.

Are there any missing values/padding? The timeseries are not of equal length.

Is there a picture for the data we can put on the website? Yes, see https://code.siemens.com/thomas.bierweiler/faultsof4-tankbatchprocess/-/blob/main/SmA-Overview.png

How would you like the donation attributed on the website? Courtesy of Siemens AG, Digital Industries, Process Automation, DI PA TI DPO.

Can you provide a paragraph description, including the meaning of the class values, for the website? The four-tank batch process has been operated in normal operation and in 9 fault types. The fault types comprise leakage, stuck valve, no venting, low and high drive speed of a pump and a 2nd pump in operation.

Contact Siemens AG, DI PA TI DPO Thomas Bierweiler, thomas.bierweiler@siemens.com Dr. Daniel Labisch, daniel.labisch@siemens.com

TonyBagnall commented 2 years ago

thanks for this, sorry for the long delay! We will look at it asap

thomasbierweiler commented 2 years ago

The first version of the supplied data contains incomplete batches. I've cleaned the data. You'll find the latest version at https://code.siemens.com/thomas.bierweiler/faultsof4-tankbatchprocess/-/blob/main/SmA-Four-Tank-Batch-Process_V2.csv I'll apologize for any inconvenience. Thomas

a-pasos-ruiz commented 2 years ago

Hello, looks like the links mentioned require to login to a siemens server.

thomasbierweiler commented 2 years ago

DescriptionOfSmAFourTankBatchProcess.pdf

thomasbierweiler commented 2 years ago

I'm sorry for the trouble with the Siemens server. I intended to make the repository public. What is the maximal upload size for this post? Uploading the data fails.

thomasbierweiler commented 2 years ago

I've uploaded the data and the description to another repository. https://github.com/thomasbierweiler/FaultsOf4-TankBatchProcess

TonyBagnall commented 2 years ago

this is now in the repo, thanks for the donation, sorry it took us so long! http://timeseriesclassification.com/description.php?Dataset=Siemens

TonyBagnall commented 2 years ago

Hi, sorry about that, I'll fix donator today and maybe talk formatting with @a-pasos-ruiz

TonyBagnall commented 2 years ago

I'll close this @thomasbierweiler feel free to reopen it if you want the data treated differently.

MarcelReinert commented 1 year ago

Hi,

I'm currently a working student at Siemens and I have been working with the 4Tank-Batch Process-dataset for quite some time now. Unfortunately, the data was not formatted properly when putting it in the tsml-repo. That is why I want to provide a more detailed data science-like description of the data set than the information you can find on https://github.com/thomasbierweiler/FaultsOf4-TankBatchProcess/blob/main/DescriptionOfSmAFourTankBatchProcess.pdf. First of all, the dataset contains around 220 multidimensional time series of unequal length(length is around 9000-1300 timestamps/seconds). Important: The term 'batch' you can find often in the given data description refers to one process run performed by the batch production plant, not a batch of samples/time series or sth. else you may know in data science terminology. That means, in process manufacturing language, the dataset contains recorded data from ~220 batches.

These time series/batches are of 10 different classes/process conditions (normal operation + 9 different induced fault types). The label of each timeseries is given in the variable 'DeviationID ValueY'. These labels should definitely be distinguished from the labels given in the variable 'CuStepNo ValueY', who are currently used as y-labels when you download the dataset from the tsml repo. The variable 'CuStepNo ValueY' contains the current step number for each timestamp of the timeseries. In one batch four consecutive process steps are executed. The process plant performs different operations in these steps. For more information on the different process operation steps you can visit the mentioned process description above.
It is important to know here that in 'CuStepNo ValueY' the steps are not labeled as steps '1','2,'3','4', but as Step '1','7','8' and '3'(steps are performed in this order).

The current step number given in 'CuStepNo ValueY' is important, because by using this information you can analyze the timeseries of the different process steps individually. This is crucial for several ML tasks, as you may need to compare the time series of the same process state. As you can surely imagine, the timeseries of variable 'CuStepNo ValueY' themselves are not suitable for model training purposes.

I hope this description helps and the dataset in the tsml-repo can be fixed. If you have any questions I will be happy to answer them :)

TonyBagnall commented 1 year ago

thanks very much for this, it has not been "officially" released. I'll take it down and we can hopefully reformat it more sensibly.

TonyBagnall commented 1 year ago

hi, I have taken the Seimens data down for now. If you could formulate it as a classification problem that would be great, happy to help and get it put back in.

MarcelReinert commented 1 year ago

Hello, First of all sorry for the late answer.

The easiest way to use the dataset for classification is to use the full time series. For that, the variable 'DeviationID ValueY' in the dataset can be used as class labels and the rest of the data (besides variable 'CuStepNo ValueY', has to be discarded) as the feature vectors.

A more complicated but much more insightful alternative is to create a classification problem for every process step. As there are 4 batch process steps 4 different classification problems can be created by splitting the time series with 'CuStepNo ValueY'. As above, for classification variable 'DeviationID ValueY' gives the class labels and the rest of the dateset are the feature vectors (besides 'CuStepNo ValueY' of course). However, not every fault type is present in all of the four process steps. Information on what classes should be included in the classification for each step can be found in the Description https://github.com/thomasbierweiler/FaultsOf4-TankBatchProcess/blob/main/DescriptionOfSmAFourTankBatchProcess.pdf.

I hope this explanation is useful.