Benchmark Algorithm - Githubissues

omarrana commented 7 years ago

https://drive.google.com/file/d/0B7FScfr1FLRDQmhaRDZDYnRyRTQ/view?usp=sharing

Here is the IO diagram:

Algorithm flow is as follows: 1- config.ttl as input which shows what type of heterogeneity documents should be generated. Generate seed.aml 2- Chose a)Manual Elements: - Give parameters for initial seed and data e.g RoleClass Internal Element e.t.c b) Template :- Chose one of the preconfigured template that will generate seed.aml based on type of heterogeneities mentioned in config.ttl. Here one question arises , we could have a pre written generation code or use our already manually created database of testbeds as one of the input. Both will actually have the same result. In pre written ,the seed.aml would be generated on fly by java code e.g 2 internal Elements e.t.c

3- for seed.aml create two or more files with mentioned heterogeneity in config.ttl a) RDF-based knowledge b) Java based code As you said there should be our knowledge base. I dont understand what should be in RDF based knowledge which will help us generate the two files. for e.g using b: java code we can create M1 heterogeneity by converting float to integer ,date to datetime in two files. How can this be converted to RDF base knowledge?

4- Exit

igrangel commented 7 years ago

I can't open it as a diagram and the quality of the image is not good. Could you please share it as a diagram?

omarrana commented 7 years ago

https://drive.google.com/file/d/0B7FScfr1FLRDQmhaRDZDYnRyRTQ/view?usp=sharing

on top there will be open with io.draw

igrangel commented 7 years ago

What you said in 1 is correct.

"1- config.ttl as input which shows what type of heterogeneity documents should be generated." type and combination of types, but this is not properly shown in the diagram. The seed.aml can represent a general AML document, which is used as a general template for the document.
The second thing we need to deal with is which elements can be associated with different types of heterogeneity. For example, Attributes can be associated with M1, etc. This information, I think, can go into the description needed it in the input.
How the elements can be correctly plugged into the CAEX tree.
There should be an Activity that checks for the consistency of the generated AML documents, and if everything is correct, then we will have the two documents as an output as well as meta-data about what have been generated.

As a general remark, we need to make something configurable. In a way that people who knows the domain can easily add information about the heterogeneities and our code can use this information to generate more documents.

igrangel commented 7 years ago

Please check this (http://se-pubs.dbs.uni-leipzig.de/files/Weis2006ADuplicateDetectionBenchmark.pdf) paper. We should follow similar strategies. Look at section 2.3. Benchmark limitations, where we maybe can improve some of them. In addition, please check this and this papers.

i40-Tools / CPSDocumentGenerator

Benchmark Algorithm #10