hogent-cads / AI_MDM_Prototype

GNU Affero General Public License v3.0
0 stars 1 forks source link

Deduplicatie: geen matches gevonden #33

Closed slievens closed 1 year ago

slievens commented 1 year ago

afbeelding

Na meer dan 160 paren te markeren, nog maar 1 match gevonden. De blog post mbt deduplicatie zegt dat er na 53 paren al een model kon worden gebouwd?

slievens commented 1 year ago

Misschien heeft dit hier iets mee te maken. Dit is de uitvoer van de Spark Zingg job. Bij opstarten en na het labelen van 20 paren of zo. De tweede keer ziet de uitvoer er nog steeds hetzelfde uit.

Uitvoer van Zingg de eerste maal

RuleFinder | run_zingg_phase | 2023-04-22 16:12:04 | DEBUG | b"2023-04-22 16:11:51,060 [main] WARN org.apache.spark.util.Utils - Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface enp0s3)\n 2023-04-22 16:11:51,061 [main] WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address\n 2023-04-22 16:11:51,464 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n ['--phase', 'findTrainingData']\narguments for client options are ['--phase', 'findTrainingData', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']\n2023-04-22 16:11:55,553 [Thread-5] INFO zingg.client.ClientOptions - --phase\n 2023-04-22 16:11:55,554 [Thread-5] INFO zingg.client.ClientOptions - findTrainingData\n 2023-04-22 16:11:55,554 [Thread-5] INFO zingg.client.ClientOptions - --license\n 2023-04-22 16:11:55,554 [Thread-5] INFO zingg.client.ClientOptions - zinggLic.txt\n 2023-04-22 16:11:55,554 [Thread-5] INFO zingg.client.ClientOptions - --email\n 2023-04-22 16:11:55,554 [Thread-5] INFO zingg.client.ClientOptions - zingg@zingg.ai\n 2023-04-22 16:11:55,554 [Thread-5] INFO zingg.client.ClientOptions - --conf\n 2023-04-22 16:11:55,554 [Thread-5] INFO zingg.client.ClientOptions - dummyConf.json\n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - **\n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - Note about analytics collection by Zingg AI \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - Please note that Zingg captures a few metrics about application's \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - runtime parameters. However, no user's personal data or application \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - data is captured. If you want to switch off this feature, please \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - set the flag collectMetrics to false in config. For details, please \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - refer to the Zingg docs (https://docs.zingg.ai/docs/security.html) \n 2023-04-22 16:11:55,560 [Thread-5] INFO zingg.client.Client - **\n 2023-04-22 16:11:55,561 [Thread-5] INFO zingg.client.Client - \n 2023-04-22 16:11:55,572 [Thread-5] WARN org.apache.spark.sql.SparkSession$Builder - Using an existing SparkSession; some spark core configurations may not take effect.\n 2023-04-22 16:11:55,696 [Thread-5] WARN org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.\n 2023-04-22 16:11:55,696 [Thread-5] INFO zingg.ZinggBase - Start reading internal configurations and functions\n 2023-04-22 16:11:55,703 [Thread-5] INFO zingg.ZinggBase - Finished reading internal configurations and functions\n 2023-04-22 16:11:55,712 [Thread-5] WARN zingg.util.PipeUtil - Reading input csv\n 2023-04-22 16:11:55,713 [Thread-5] WARN zingg.util.PipeUtil - Reading Pipe [name=input1, format=csv, preprocessors=null, props={location=storage/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/input_dir/input.csv}, schema=StructType(StructField(id,StringType,true), StructField(artist,StringType,true), StructField(title,StringType,true), StructField(category,StringType,true), StructField(genre,StringType,true), StructField(year,StringType,true), StructField(track01,StringType,true), StructField(track02,StringType,true), StructField(track03,StringType,true), StructField(track04,StringType,true))]\n 2023-04-22 16:11:58,997 [Thread-5] WARN zingg.TrainingDataFinder - Read input data 501\n 2023-04-22 16:11:58,998 [Thread-5] WARN zingg.util.PipeUtil - Reading input parquet\n 2023-04-22 16:11:59,000 [Thread-5] WARN zingg.util.PipeUtil - Reading Pipe [name=null, format=parquet, preprocessors=null, props={location=storage/f193d6e7- 7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/models/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/trainingData//marked/}, schema=null]\n 2023-04-22 16:11:59,019 [Thread-5] WARN zingg.util.PipeUtil - Path does not exist: file:/home/vagrant/AI_MDM_Prototype/storage/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd /models/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/trainingData/marked\n 2023-04-22 16:11:59,020 [Thread-5] WARN zingg.util.DSUtil - No preexisting marked training samples\n 2023-04-22 16:11:59,020 [Thread-5] WARN zingg.util.DSUtil - No configured training samples\n 2023-04-22 16:11:59,020 [Thread-5] WARN zingg.util.DSUtil - No training data found\n 2023-04-22 16:11:59,147 [Thread-5] INFO zingg.TrainingDataFinder - Created positive sample pairs \n 2023-04-22 16:11:59,311 [Thread-5] INFO zingg.TrainingDataFinder - Preprocessing DS for stopWords\n 2023-04-22 16:11:59,756 [Thread-5] INFO zingg.util.Heuristics - Block size 8 and total count was 258\n 2023-04-22 16:11:59,757 [Thread-5] INFO zingg.util.Heuristics - Heuristics suggest 8\n 2023-04-22 16:11:59,761 [Thread-5] INFO zingg.util.BlockingTreeUtil - Learning indexing rules for block size 8\n 2023-04-22 16:12:00,612 [Thread-5] INFO zingg.TrainingDataFinder - Writing uncertain pairs when either positive or negative samples not provided \n 2023-04-22 16:12:00,652 [Thread-5] WARN org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.\n 2023-04-22 16:12:01,547 [Thread-5] WARN zingg.util.PipeUtil - Writing output Pipe [name=null, format=parquet, preprocessors=null, props={location=storage/f193d6e7- 7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/models/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/trainingData//unmarked/}, schema=null]\n 2023-04-22 16:12:01,547 [Thread-5] WARN zingg.util.PipeUtil - Writing file\n "

Uitvoer van zingg nadat 20 paren werden gemarkeerd

RuleFinder | run_zingg_phase | 2023-04-22 16:17:18 | DEBUG | b"2023-04-22 16:17:04,618 [main] WARN org.apache.spark.util.Utils - Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface enp0s3)\n 2023-04-22 16:17:04,619 [main] WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address\n 2023-04-22 16:17:05,027 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n ['--phase', 'findTrainingData']\narguments for client options are ['--phase', 'findTrainingData', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']\n2023-04-22 16:17:09,098 [Thread-5] INFO zingg.client.ClientOptions - --phase\n 2023-04-22 16:17:09,099 [Thread-5] INFO zingg.client.ClientOptions - findTrainingData\n 2023-04-22 16:17:09,099 [Thread-5] INFO zingg.client.ClientOptions - --license\n 2023-04-22 16:17:09,099 [Thread-5] INFO zingg.client.ClientOptions - zinggLic.txt\n 2023-04-22 16:17:09,099 [Thread-5] INFO zingg.client.ClientOptions - --email\n 2023-04-22 16:17:09,099 [Thread-5] INFO zingg.client.ClientOptions - zingg@zingg.ai\n 2023-04-22 16:17:09,099 [Thread-5] INFO zingg.client.ClientOptions - --conf\n 2023-04-22 16:17:09,099 [Thread-5] INFO zingg.client.ClientOptions - dummyConf.json\n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - \n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - **\n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - Note about analytics collection by Zingg AI \n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - \n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - Please note that Zingg captures a few metrics about application's \n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - runtime parameters. However, no user's personal data or application \n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - data is captured. If you want to switch off this feature, please \n 2023-04-22 16:17:09,101 [Thread-5] INFO zingg.client.Client - set the flag collectMetrics to false in config. For details, please \n 2023-04-22 16:17:09,105 [Thread-5] INFO zingg.client.Client - refer to the Zingg docs (https://docs.zingg.ai/docs/security.html) \n 2023-04-22 16:17:09,105 [Thread-5] INFO zingg.client.Client - **\n 2023-04-22 16:17:09,107 [Thread-5] INFO zingg.client.Client - \n 2023-04-22 16:17:09,116 [Thread-5] WARN org.apache.spark.sql.SparkSession$Builder - Using an existing SparkSession; some spark core configurations may not take effect.\n 2023-04-22 16:17:09,307 [Thread-5] WARN org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.\n 2023-04-22 16:17:09,307 [Thread-5] INFO zingg.ZinggBase - Start reading internal configurations and functions\n 2023-04-22 16:17:09,324 [Thread-5] INFO zingg.ZinggBase - Finished reading internal configurations and functions\n 2023-04-22 16:17:09,347 [Thread-5] WARN zingg.util.PipeUtil - Reading input csv\n 2023-04-22 16:17:09,348 [Thread-5] WARN zingg.util.PipeUtil - Reading Pipe [name=input1, format=csv, preprocessors=null, props={location=storage/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/input_dir/input.csv}, schema=StructType(StructField(id,StringType,true), StructField(artist,StringType,true), StructField(title,StringType,true), StructField(category,StringType,true), StructField(genre,StringType,true), StructField(year,StringType,true), StructField(track01,StringType,true), StructField(track02,StringType,true), StructField(track03,StringType,true), StructField(track04,StringType,true))]\n 2023-04-22 16:17:12,782 [Thread-5] WARN zingg.TrainingDataFinder - Read input data 501\n 2023-04-22 16:17:12,783 [Thread-5] WARN zingg.util.PipeUtil - Reading input parquet\n 2023-04-22 16:17:12,785 [Thread-5] WARN zingg.util.PipeUtil - Reading Pipe [name=null, format=parquet, preprocessors=null, props={location=storage/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/models/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/trainingData//marked/}, schema=null]\n 2023-04-22 16:17:12,937 [Thread-5] WARN zingg.util.PipeUtil - Unable to infer schema for Parquet. It must be specified manually.\n 2023-04-22 16:17:12,937

Deze volgende lijn lijkt verdacht !

[Thread-5] WARN zingg.util.DSUtil - No preexisting marked training samples\n 2023-04-22 16:17:12,939

[Thread-5] WARN zingg.util.DSUtil - No configured training samples\n 2023-04-22 16:17:12,939 [Thread-5] WARN zingg.util.DSUtil - No training data found\n 2023-04-22 16:17:13,069 [Thread-5] INFO zingg.TrainingDataFinder - Created positive sample pairs \n 2023-04-22 16:17:13,225 [Thread-5] INFO zingg.TrainingDataFinder - Preprocessing DS for stopWords\n 2023-04-22 16:17:13,696 [Thread-5] INFO zingg.util.Heuristics - Block size 8 and total count was 263\n 2023-04-22 16:17:13,696 [Thread-5] INFO zingg.util.Heuristics - Heuristics suggest 8\n 2023-04-22 16:17:13,696 [Thread-5] INFO zingg.util.BlockingTreeUtil - Learning indexing rules for block size 8\n 2023-04-22 16:17:14,530 [Thread-5] INFO zingg.TrainingDataFinder - Writing uncertain pairs when either positive or negative samples not provided \n 2023-04-22 16:17:14,579 [Thread-5] WARN org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.\n 2023-04-22 16:17:15,526 [Thread-5] WARN zingg.util.PipeUtil - Writing output Pipe [name=null, format=parquet, preprocessors=null, props={location=storage/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/models/f193d6e7-7cca-4185-a298-6771cc857a1c-5af56602494a066b34f8f87846ac72dd/trainingData//unmarked/}, schema=null]\n 2023-04-22 16:17:15,527 [Thread-5] WARN zingg.util.PipeUtil - Writing file\n "

slievens commented 1 year ago

Dit probleem zou hopelijk opgelost moeten zijn met commit https://github.com/hogent-cads/AI_MDM_Prototype/commit/1fc68dc382cb5b4a5bc3149509911d41bded5c61