sekruse / metanome-cli

Run Metanome algorithms from the command line
http://www.metanome.de/
Apache License 2.0
7 stars 6 forks source link

Examples running existing algorithms on existing data #6

Open nkons opened 6 years ago

nkons commented 6 years ago

Hi,

I am trying to invoke from Eclipse the main method in de.metanome.cli.App, in order to get one of the existing algorithms to produce profiling data about some data I have locally.

Looking at the documentation (user/developer guides), it wasn't obvious how to do this, so hopefully this issue will be of help to others as well.

My starting point would be to use DUCC to discover keys in a set of CSV files. The arguments I tried using look like the following, but no luck yet (I keep getting 'Could not initialize algorithm.').

-a de.metanome.algorithms.ducc.DuccAlgorithm
--files load:/path/to/a.csv;/path/to/b.csv;/path/to/c.csv
--file-key INPUT_GENERATOR
--algorithm-config NULL_EQUALS_NULL:true VALIDATE_PARALLEL:false 
 ENABLE_MEMORY_GUARDIAN:false MAX_UCC_SIZE:1000 INPUT_ROW_LIMIT:1000

So, I believe some examples using metanome-cli would be useful, eg. : a) on top of existing CSV files (say, DUCC, for key discovery) and, b) on top of a relational backend (say, BINDER-Database, to discover foreign keys in 3 tables. What happens if these are in different schemas/databases?). Not clear how to store database connection settings (is a ProfileDB necessary? if so, how would an example look like?).

Thank you in advance.

Best, Nikos

sekruse commented 6 years ago

Hi Nikos,

thanks for pointing these issues out and I think having examples in the readme is an excellent idea!

That being said, let me see if I can help you troubleshooting your problem at hand:

Last but not least, Metanome CLI does not support multiple database connections (although technically that should be possible).

nkons commented 6 years ago

Hi,

Thanks for the prompt response. I am now able to run Ducc. Apparently, I had the class name wrong, so the constructor was missing. Corrected it to de.metanome.algorithms.ducc.Ducc, and it now works.

Also created a pgpass file, and I am able to process relational data.

Now, the problem is with BINDERDatabase as, to my understanding, the algorithm needs a set of tables as input, which I can provide by passing the argument to metanome-cli: --inputs load:/path/to/tables.txt but when it comes to algorithm parameters, I am not sure how to pass more than one input tables to --algorithm-config. Specifically, the arguments look like the following:

-a de.metanome.algorithms.binder.BINDERDatabase
--inputs load:/path/to/tables.txt
--input-key INPUT_DATABASE 
--db-connection /path/to/db.pgpass
--db-type postgresql
--output print
--algorithm-config
TEMP_FOLDER_PATH:/tmp/
DATABASE_NAME:testdb
DATABASE_TYPE:POSTGRESQL
INPUT_TABLES:load:/path/to/tables.txt

The issue here is with the last line of the above. BINDERDatabase requires the table names to be specified using INPUT_TABLES but I wasn't sure how to pass an array of parameter values for the algorithm configuration of this parameter. Just leaving the --inputs and ommitting the INPUT_TABLES: causes the algorithm to crash.

So, I would like to ask whether there is a way to specify multi-valued algorithm parameters, using load:, or e.g. something like:

--algorithm-config INPUT_TABLES:[schema1.table1,schema2.table2,schema3.table3]

Thanks for your support.

Best, Nikos

sekruse commented 6 years ago

I am glad to hear that DUCC is already running!

I had a look at BINDERDatabase to see how it's configured. Apparently, it requires a DatabaseConnectionGenerator for which the current setup code in metanome-cli seems weird.

-a de.metanome.algorithms.binder.BINDERDatabase
--input-key INPUT_DATABASE
--inputs some_ignored_value
--db-connection /path/to/db.pgpass
--db-type postgresql
--output print
--algorithm-config
TEMP_FOLDER_PATH:/tmp/
DATABASE_NAME:testdb
DATABASE_TYPE:POSTGRESQL
INPUT_TABLES:???
...

Unfortunately, multiple values are not supported in the current version. But I have seen just now that this pull request might introduce the functionality by supporting repeated key:value pairs, i.e. --algorithm-config INPUT_TABLES:table1 INPUT_TABLE:table2 .... Unfortunately, the contributor has changed the formatting so that it will take some time to review the changes, but I guess that this can help you.

I would be happy to hear from you whether the PR works fine for you!

fyndalf commented 5 years ago

Hi, sorry for hijacking this issue, I didn't know where else to ask this:

When running the metanome-cli, I faced two problems:

1.) When running BINDERDatabase,

Initializing algorithm.
Algorithm does not implement a supported input method (relational/tables).

was returned, despite using the latest algorithm and a (hopefully correct) configuration. I can't quite see how I should change my configuration to fix this.

2.) When running BINDERFile, I still don't quite understand what the --input-key parameter does, as I've already specified the files using the --files key. Putting anything as an input key yields a de.metanome.algorithm_integration.AlgorithmConfigurationException: Unknown configuration: FILENAME -> de.metanome.backend.input.file.DefaultFileInputGenerator What do I actually have to provide with that key for the algorithm to work?

Thanks in advance!

sekruse commented 5 years ago

No worries, let's see if I can help you there.

  1. BINDERDatabase is a DatabaseConnectionParameterAlgorithm, but it implements neither RelationalInputParameterAlgorithm nor TableInputParameterAlgorithm, as expected by Metanome CLI. Turns out that BINDERDatabase creates the table inputs itself - a case that Metanome CLI apparently did not account for. However, in the light of this new insight, Metanome CLI should only fail if a --db-connection/parameters.pgpassPath is specified and the specified algorithm is neither a RelationalInputParameterAlgorithm nor a TableInputParameterAlgorithm nor a DatabaseConnectionParameterAlgorithm. Furthermore, it should not be possible for an algorithm to be a (RelationalInputParameterAlgorithm xor TableInputParameterAlgorithm) and DatabaseConnectionParameterAlgorithm at the same time, as already allowed. All this should be easily achieved by restructuring this code branch.

    Do you want to give it a try and see if that does the trick for you? It would be awesome if you could share a PR on success!

  2. Metanome algorithms are configured via key-value pairs. The input files are also a key-value pair, but Metanome CLI has a special way of exposing them (namely via --input-key <key> --input-files <values...>, because they require special interpretation. For BINDERFile, the --input-key parameter must be INPUT_FILES.

Feel free to reach out if you have further questions!

wunderbarr commented 5 years ago

Hello, I would like to ask for running csv files. I test the adult csv and my cmd is: java -cp metanome-cli-1.1.0.jar:pyro-distro-1.0-SNAPSHOT-distro.jar de.metanome.cli.App --algorithm de.hpi.isg.pyro.algorithms.Pyro --files load:file.txt --file-key INPUT_GENERATOR. In the file.txt I store the path to csv file. And I get the error: Running de.hpi.isg.pyro.algorithms.Pyro

wunderbarr commented 5 years ago

Hello, I add slf4j-simple-1.7.25.jar to the classpath then the above error disappears. But I still cannot run Pyro.

My cmd: java -cp metanome-cli-1.1.0.jar:slf4j-simple-1.7.25.jar:pyro-distro-1.0-SNAPSHOT-distro.jar de.metanome.cli.App --algorithm de.hpi.isg.pyro.algorithms.Pyro --files adult.csv --file-key INPUT_GENERATOR --algorithm-config maxFdError:0.01

Is there any configuration problem?

Running de.hpi.isg.pyro.algorithms.Pyro

  • in: [adult.csv]
  • out: file
  • configuration: [maxFdError:0.01] Initializing algorithm. Could not initialize algorithm. java.lang.IllegalArgumentException: Unsupported argument. at de.hpi.isg.pyro.algorithms.Pyro.setRelationalInputConfigurationValue(Pyro.java:407) at de.metanome.cli.App.setUpInputGenerators(App.java:374) at de.metanome.cli.App.configureAlgorithm(App.java:268) at de.metanome.cli.App.run(App.java:83) at de.metanome.cli.App.main(App.java:47)

Thank you!

sekruse commented 5 years ago

I think that it should read --file-key inputFile (cf. here and here).

wunderbarr commented 5 years ago

Thank you! It works!

Ryang326 commented 5 years ago

Hello sekruse, I would like to ask for running csv files for fun algorithm to figure out the fds. I test the iris.csv and my cmd is: java -cp metanome-cli-1.1.jar:fun_for_metanome-0.0.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.uni_potsdam.hpi.metanome.algorithms.fun.Fun --file-key Relational Input --files load:iris.csv In the file.txt I store the path to csv file. And I get the error:Could not parse command line args: Was passed main parameter 'Input' but no main parameter was defined I tried the cmd without any parameter. It asked me to input: --file-key --input-key, --table-key --files, --inputs, --tables However, after reading problems above I only try to figure out the file-key and files parameter and i am not sure it is right. I am confused with what should I input for these six parameters. Could you please give me some hints? Thank you!

sekruse commented 5 years ago

Hi Ryang326 and sorry for the delayed response.

Essentially, --file-key, --input-key, and --table-key are all the same and what you need to put here depends on the Metanome algorithm. For FUN, you need to specify indeed Relational Input according to the code. However, Relational Input needs to be put in quotation marks. Otherwise, your Shell will deliver it as two individual arguments to the Metanome CLI, which expects only a single argument.

--files, --inputs, and --tables are also all the same and you can pick one of them. After that, you can add your CSV file. But don't use load:. That would be correct only if iris.csv was a file, which contained a list of CSV files.

So you can try:

java -cp metanome-cli-1.1.jar:fun_for_metanome-0.0.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.uni_potsdam.hpi.metanome.algorithms.fun.Fun --file-key "Relational Input" --files iris.csv
cccshuang commented 5 years ago

when use DC algorithm: java -cp metanome-cli.jar;hydra-1.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.hpi.naumann.dc.algorithms.hybrid.HydraMetanome --file-key "INPUT" --files Tax.csv --escape \ --separator , --algorithm-config EFFICIENCY_THRESHOLD:0.005 CROSS_COLUMN_STRING_MIN_OVERLAP:0.15 SAMPLE_ROUNDS:20 NO_CROSS_COLUMN:false I get the error: ... 15:33:54.771 [main] INFO d.h.n.d.a.hybrid.HydraMetanome - Result size: 5117 Algorithm crashed. java.lang.NullPointerException at de.hpi.naumann.dc.algorithms.hybrid.HydraMetanome.execute(HydraMetanome.java:76) at de.metanome.cli.App.run(App.java:89) at de.metanome.cli.App.main(App.java:47) Elapsed time: 0:00:03.332 (3332 ms). DC can be work in Metanome Tool, but failed by Metanome-cli. I think it may be cause by this: Initializing algorithm. Could not configure any result receiver.. But I dont know how to sovle it. Could you please give me some hints? Thank you very much!

sekruse commented 5 years ago

Since this line is crashing, I think it really is the missing result receiver.

In fact, it appears that this method is lacking the necessary code to configure a DenialConstraintResultReceiver for DenialConstraintAlgorithms.

Unfortunately, I don't have the time to fix this. Do you want to send a PR with a fix?

cccshuang commented 5 years ago

Since this line is crashing, I think it really is the missing result receiver.

In fact, it appears that this method is lacking the necessary code to configure a DenialConstraintResultReceiver for DenialConstraintAlgorithms.

Unfortunately, I don't have the time to fix this. Do you want to send a PR with a fix?

Ok. After add some code in "configureResultReceiver" method, it can support DC now. It may reyly on the latested metanome 1.2 , beacuse of metanome 1.1 may be not contain DC algorithm. I don't know when I run the project it would cause this error, so I also add depency com.ecwid.ecwid-mailchimp to solve this problem.

faisal-ksolves commented 1 year ago

hello @sekruse, what is the maximum dataset size metanome algorithms can run on?

sekruse commented 1 year ago

@faisal-ksolves – That depends on the algorithm, your hardware, and various dataset properties besides it size. Most often, RAM is the limiting factor, especially for datasets with many columns. Please refer to the research papers of the individual algorithms for a detailed evaluation.

faisal-ksolves commented 1 year ago

@sekruse can i get some quick links of those papers?

sekruse commented 1 year ago

@faisal-ksolves – https://hpi.de/naumann/projects/data-profiling-and-analytics/metanome-data-profiling.html should contain most links. The BINDER paper is called Divide & Conquer-based Inclusion Dependency Discovery.

faisal-ksolves commented 1 year ago

Hello, can anyone help me to run HyMD algorithm on metanome cl, actually it throws error while run. Exception in thread "main" java.lang.NoSuchFieldError: SNAKE_CASE at de.metanome.algorithms.hymd.Jackson.createMapper(Jackson.java:22) at de.metanome.algorithms.hymd.Jackson.createReader(Jackson.java:16) at de.metanome.algorithms.hymd.HyMD.readConfig(HyMD.java:198) at de.metanome.algorithms.hymd.HyMD.setStringConfigurationValue(HyMD.java:156) at de.metanome.backend.configuration.ConfigurationValueString.triggerSetValue(ConfigurationValueString.java:69) at de.metanome.cli.Helpers.AlgorithmInitializer.triggerSetValue(AlgorithmInitializer.java:173) at de.metanome.cli.Helpers.AlgorithmInitializer.apply(AlgorithmInitializer.java:79) at de.metanome.cli.App.loadMiscConfigurations(App.java:372) at de.metanome.cli.App.configureAlgorithm(App.java:343) at de.metanome.cli.App.run(App.java:134) at de.metanome.cli.App.main(App.java:98) I am using the following command --algorithm de.metanome.algorithms.hymd.HyMD --file-key RELATION --files src/main/java/de/metanome/cli/Inputs/test.csv --header

RaVincentHuang commented 1 year ago

Hello sekruse, I can't run CFDFinder on cli. My algorithm and cli version are both 1.2. My command is java -cp metanome-cli-1.2-SNAPSHOT.jar:CFDFinder-1.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.metanome.algorithms.cfdfinder.CFDFinder --files ./adult.csv --file-key INPUT_GENERATOR --output print And it throws:

(metanome-cli) ERROR    Could not configure any result receiver.
(metanome-cli) ERROR    Algorithm crashed.: de.metanome.algorithm_integration.AlgorithmConfigurationException: No result receiver set!
    at de.metanome.algorithms.cfdfinder.CFDFinder.execute(CFDFinder.java:286)
    at de.metanome.cli.App.run(App.java:110)
    at de.metanome.cli.App.main(App.java:75)
(metanome-cli) INFO     Elapsed time: 0:00:00.001 (1 ms).
(metanome-cli) INFO     Results:

I found that CFD's Receiver has been implemented in the 1.2 cli version, but the code if (algorithm instanceof ConditionalFunctionalDependencyAlgorithm) did not execute successfully. I don't know the reason for this problem.

RaVincentHuang commented 1 year ago

Hello sekruse, I can't run CFDFinder on cli. My algorithm and cli version are both 1.2. My command is java -cp metanome-cli-1.2-SNAPSHOT.jar:CFDFinder-1.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.metanome.algorithms.cfdfinder.CFDFinder --files ./adult.csv --file-key INPUT_GENERATOR --output print And it throws:

(metanome-cli) ERROR    Could not configure any result receiver.
(metanome-cli) ERROR    Algorithm crashed.: de.metanome.algorithm_integration.AlgorithmConfigurationException: No result receiver set!
    at de.metanome.algorithms.cfdfinder.CFDFinder.execute(CFDFinder.java:286)
    at de.metanome.cli.App.run(App.java:110)
    at de.metanome.cli.App.main(App.java:75)
(metanome-cli) INFO     Elapsed time: 0:00:00.001 (1 ms).
(metanome-cli) INFO     Results:

I found that CFD's Receiver has been implemented in the 1.2 cli version, but the code if (algorithm instanceof ConditionalFunctionalDependencyAlgorithm) did not execute successfully. I don't know the reason for this problem.