mindsdb / mindsdb

The platform for building AI from enterprise data
https://mindsdb.com
Other
26.43k stars 4.8k forks source link

[Bug] how to use CREATE predictor USING dtype_dict ..... ? #2346

Closed gattack closed 2 years ago

gattack commented 2 years ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

i have query create predictor like this CREATE PREDICTOR PR2039$test FROM jw_ch (selecttype,isDom,airline,flightCode,fareClass,cabinClass,rute,distance,currency,departDay,weekday,isWeekend,holidayWeek,holidayMonth,holiday,holidayBefore,holidayAfter,originCountry,destinationCountry,airlineType,duration,supplier,departTime,arrivalTime,transitDepartTime,transitArrivalTime,aircraftType,popularRoute,uniqueFlight,vtl,quarantineDays,total, toHour(updateAt) as updateHour from machine_learning.pp_train_data ptd where flightCode = 'PR2039') PREDICT total USING dtype_dict = '{ "type": "categorical", "isDom": "binary", "airline": "categorical", "flightCode": "categorical", "fareClass": "categorical", "cabinClass": "categorical", "rute": "categorical", "distance": "integer", "currency": "categorical", "departDay": "integer", "isWeekend": "binary", "holidayWeek": "integer", "holidayMonth": "integer", "holiday": "binary", "holidayBefore": "integer", "holidayAfter": "integer", "originCountry": "categorical", "destinationCountry": "categorical", "airlineType": "binary", "duration": "integer", "supplier": "categorical", "departTime": "integer", "arrivalTime": "integer", "transitDepartTime": "integer", "transitArrivalTime": "integer", "aircraftType": "categorical", "popularRoute": "binary", "uniqueFlight": "binary", "vtl": "binary", "quarantineDays": "integer", "total": "float", "updateHour": "integer" };

after execute, i try to check the status using rest api and here is the result (error) status

{ "code": "import lightwood\nfrom lightwood import __version__ as lightwood_version\nfrom lightwood.analysis import *\nfrom lightwood.api import *\nfrom lightwood.data import *\nfrom lightwood.encoder import *\nfrom lightwood.ensemble import *\nfrom lightwood.helpers.device import *\nfrom lightwood.helpers.general import *\nfrom lightwood.helpers.log import *\nfrom lightwood.helpers.numeric import *\nfrom lightwood.helpers.imputers import *\nfrom lightwood.helpers.parallelism import *\nfrom lightwood.helpers.seed import *\nfrom lightwood.helpers.text import *\nfrom lightwood.helpers.torch import *\nfrom lightwood.mixer import *\nimport pandas as pd\nfrom typing import Dict, List, Union\nimport os\nfrom types import ModuleType\nimport importlib.machinery\nimport sys\nimport time\n\n\nfor import_dir in [\n os.path.join(\n os.path.expanduser(\"~/lightwood_modules\"), lightwood_version.replace(\".\", \"_\")\n ),\n os.path.join(\"/etc/lightwood_modules\", lightwood_version.replace(\".\", \"_\")),\n]:\n if os.path.exists(import_dir) and os.access(import_dir, os.R_OK):\n for file_name in list(os.walk(import_dir))[0][2]:\n if file_name[-3:] != \".py\":\n continue\n mod_name = file_name[:-3]\n loader = importlib.machinery.SourceFileLoader(\n mod_name, os.path.join(import_dir, file_name)\n )\n module = ModuleType(loader.name)\n loader.exec_module(module)\n sys.modules[mod_name] = module\n exec(f\"import {mod_name}\")\n\n\nclass Predictor(PredictorInterface):\n target: str\n mixers: List[BaseMixer]\n encoders: Dict[str, BaseEncoder]\n ensemble: BaseEnsemble\n mode: str\n\n def __init__(self):\n seed(420)\n self.target = \"total\"\n self.mode = \"inactive\"\n self.problem_definition = ProblemDefinition.from_dict(\n {\n \"target\": \"total\",\n \"pct_invalid\": 2,\n \"unbias_target\": True,\n \"seconds_per_mixer\": 57024.0,\n \"seconds_per_encoder\": None,\n \"expected_additional_time\": 14.5330491065979,\n \"time_aim\": 259200,\n \"target_weights\": None,\n \"positive_domain\": False,\n \"timeseries_settings\": {\n \"is_timeseries\": False,\n \"order_by\": None,\n \"window\": None,\n \"group_by\": None,\n \"use_previous_target\": True,\n \"horizon\": None,\n \"historical_columns\": None,\n \"target_type\": \"\",\n \"allow_incomplete_history\": True,\n \"eval_cold_start\": True,\n \"interval_periods\": [],\n },\n \"anomaly_detection\": False,\n \"use_default_analysis\": True,\n \"ignore_features\": [],\n \"fit_on_all\": True,\n \"strict_mode\": True,\n \"seed_nr\": 420,\n }\n )\n self.accuracy_functions = [\"r2_score\"]\n self.identifiers = {\n \"type\": \"No Information\",\n \"isDom\": \"No Information\",\n \"airline\": \"No Information\",\n \"flightCode\": \"No Information\",\n \"rute\": \"No Information\",\n \"distance\": \"No Information\",\n \"originCountry\": \"No Information\",\n \"destinationCountry\": \"No Information\",\n \"airlineType\": \"No Information\",\n \"duration\": \"No Information\",\n \"transitDepartTime\": \"No Information\",\n \"transitArrivalTime\": \"No Information\",\n \"aircraftType\": \"No Information\",\n \"popularRoute\": \"No Information\",\n \"vtl\": \"No Information\",\n \"quarantineDays\": \"No Information\",\n }\n self.dtype_dict = {\n \"fareClass\": \"categorical\",\n \"cabinClass\": \"categorical\",\n \"currency\": \"categorical\",\n \"departDay\": \"integer\",\n \"weekday\": \"categorical\",\n \"isWeekend\": \"binary\",\n \"holidayWeek\": \"integer\",\n \"holidayMonth\": \"integer\",\n \"holiday\": \"binary\",\n \"holidayBefore\": \"integer\",\n \"holidayAfter\": \"integer\",\n \"supplier\": \"categorical\",\n \"departTime\": \"integer\",\n \"arrivalTime\": \"integer\",\n \"uniqueFlight\": \"binary\",\n \"total\": \"float\",\n \"updateHour\": \"integer\",\n \"type\": \"categorical\",\n \"isDom\": \"binary\",\n \"airline\": \"categorical\",\n \"flightCode\": \"categorical\",\n \"rute\": \"categorical\",\n \"distance\": \"integer\",\n \"originCountry\": \"categorical\",\n \"destinationCountry\": \"categorical\",\n \"airlineType\": \"binary\",\n \"duration\": \"integer\",\n \"transitDepartTime\": \"integer\",\n \"transitArrivalTime\": \"integer\",\n \"aircraftType\": \"categorical\",\n \"popularRoute\": \"binary\",\n \"vtl\": \"binary\",\n \"quarantineDays\": \"integer\",\n }\n\n # Any feature-column dependencies\n self.dependencies = {\n \"total\": [],\n \"fareClass\": [],\n \"cabinClass\": [],\n \"currency\": [],\n \"departDay\": [],\n \"weekday\": [],\n \"isWeekend\": [],\n \"holidayWeek\": [],\n \"holidayMonth\": [],\n \"holiday\": [],\n \"holidayBefore\": [],\n \"holidayAfter\": [],\n \"supplier\": [],\n \"departTime\": [],\n \"arrivalTime\": [],\n \"uniqueFlight\": [],\n \"updateHour\": [],\n }\n\n self.input_cols = [\n \"fareClass\",\n \"cabinClass\",\n \"currency\",\n \"departDay\",\n \"weekday\",\n \"isWeekend\",\n \"holidayWeek\",\n \"holidayMonth\",\n \"holiday\",\n \"holidayBefore\",\n \"holidayAfter\",\n \"supplier\",\n \"departTime\",\n \"arrivalTime\",\n \"uniqueFlight\",\n \"updateHour\",\n ]\n\n # Initial stats analysis\n self.statistical_analysis = None\n self.runtime_log = dict()\n\n @timed\n def analyze_data(self, data: pd.DataFrame) -> None:\n # Perform a statistical analysis on the unprocessed data\n\n log.info(\"Performing statistical analysis on data\")\n self.statistical_analysis = lightwood.data.statistical_analysis(\n data,\n self.dtype_dict,\n {\n \"type\": \"No Information\",\n \"isDom\": \"No Information\",\n \"airline\": \"No Information\",\n \"flightCode\": \"No Information\",\n \"rute\": \"No Information\",\n \"distance\": \"No Information\",\n \"originCountry\": \"No Information\",\n \"destinationCountry\": \"No Information\",\n \"airlineType\": \"No Information\",\n \"duration\": \"No Information\",\n \"transitDepartTime\": \"No Information\",\n \"transitArrivalTime\": \"No Information\",\n \"aircraftType\": \"No Information\",\n \"popularRoute\": \"No Information\",\n \"vtl\": \"No Information\",\n \"quarantineDays\": \"No Information\",\n },\n self.problem_definition,\n )\n\n # Instantiate post-training evaluation\n self.analysis_blocks = [\n ICP(\n fixed_significance=None,\n confidence_normalizer=False,\n positive_domain=self.statistical_analysis.positive_domain,\n ),\n AccStats(deps=[\"ICP\"]),\n ConfStats(deps=[\"ICP\"]),\n ]\n\n @timed\n def preprocess(self, data: pd.DataFrame) -> pd.DataFrame:\n # Preprocess and clean data\n\n log.info(\"Cleaning the data\")\n self.imputers = {}\n data = cleaner(\n data=data,\n pct_invalid=self.problem_definition.pct_invalid,\n identifiers=self.identifiers,\n dtype_dict=self.dtype_dict,\n target=self.target,\n mode=self.mode,\n imputers=self.imputers,\n timeseries_settings=self.problem_definition.timeseries_settings,\n anomaly_detection=self.problem_definition.anomaly_detection,\n )\n\n # Time-series blocks\n\n return data\n\n @timed\n def split(self, data: pd.DataFrame) -> Dict[str, pd.DataFrame]:\n # Split the data into training/testing splits\n\n log.info(\"Splitting the data into train/test\")\n train_test_data = splitter(\n data=data,\n seed=1,\n pct_train=0.8,\n pct_dev=0.1,\n pct_test=0.1,\n tss=self.problem_definition.timeseries_settings,\n target=self.target,\n dtype_dict=self.dtype_dict,\n )\n\n return train_test_data\n\n @timed\n def prepare(self, data: Dict[str, pd.DataFrame]) -> None:\n # Prepare encoders to featurize data\n\n self.mode = \"train\"\n\n if self.statistical_analysis is None:\n raise Exception(\"Please run analyze_data first\")\n\n # Column to encoder mapping\n self.encoders = {\n \"total\": NumericEncoder(\n is_target=True,\n positive_domain=self.statistical_analysis.positive_domain,\n ),\n \"fareClass\": BinaryEncoder(),\n \"cabinClass\": OneHotEncoder(),\n \"currency\": BinaryEncoder(),\n \"departDay\": NumericEncoder(),\n \"weekday\": OneHotEncoder(),\n \"isWeekend\": BinaryEncoder(),\n \"holidayWeek\": OneHotEncoder(),\n \"holidayMonth\": OneHotEncoder(),\n \"holiday\": BinaryEncoder(),\n \"holidayBefore\": NumericEncoder(),\n \"holidayAfter\": NumericEncoder(),\n \"supplier\": OneHotEncoder(),\n \"departTime\": BinaryEncoder(),\n \"arrivalTime\": BinaryEncoder(),\n \"uniqueFlight\": NumericEncoder(),\n \"updateHour\": NumericEncoder(),\n }\n\n # Prepare the training + dev data\n concatenated_train_dev = pd.concat([data[\"train\"], data[\"dev\"]])\n\n log.info(\"Preparing the encoders\")\n\n encoder_prepping_dict = {}\n\n # Prepare encoders that do not require learned strategies\n for col_name, encoder in self.encoders.items():\n if col_name != self.target and not encoder.is_trainable_encoder:\n encoder_prepping_dict[col_name] = [\n encoder,\n concatenated_train_dev[col_name],\n \"prepare\",\n ]\n log.info(\n f\"Encoder prepping dict length of: {len(encoder_prepping_dict)}\"\n )\n\n # Setup parallelization\n parallel_prepped_encoders = mut_method_call(encoder_prepping_dict)\n for col_name, encoder in parallel_prepped_encoders.items():\n self.encoders[col_name] = encoder\n\n # Prepare the target\n if self.target not in parallel_prepped_encoders:\n if self.encoders[self.target].is_trainable_encoder:\n self.encoders[self.target].prepare(\n data[\"train\"][self.target], data[\"dev\"][self.target]\n )\n else:\n self.encoders[self.target].prepare(\n pd.concat([data[\"train\"], data[\"dev\"]])[self.target]\n )\n\n # Prepare any non-target encoders that are learned\n for col_name, encoder in self.encoders.items():\n if col_name != self.target and encoder.is_trainable_encoder:\n priming_data = pd.concat([data[\"train\"], data[\"dev\"]])\n kwargs = {}\n if self.dependencies[col_name]:\n kwargs[\"dependency_data\"] = {}\n for col in self.dependencies[col_name]:\n kwargs[\"dependency_data\"][col] = {\n \"original_type\": self.dtype_dict[col],\n \"data\": priming_data[col],\n }\n\n # If an encoder representation requires the target, provide priming data\n if hasattr(encoder, \"uses_target\"):\n kwargs[\"encoded_target_values\"] = self.encoders[self.target].encode(\n priming_data[self.target]\n )\n\n encoder.prepare(\n data[\"train\"][col_name], data[\"dev\"][col_name], **kwargs\n )\n\n @timed\n def featurize(self, split_data: Dict[str, pd.DataFrame]):\n # Featurize data into numerical representations for models\n\n log.info(\"Featurizing the data\")\n\n feature_data = {\n key: EncodedDs(self.encoders, data, self.target)\n for key, data in split_data.items()\n if key != \"stratified_on\"\n }\n\n return feature_data\n\n @timed\n def fit(self, enc_data: Dict[str, pd.DataFrame]) -> None:\n # Fit predictors to estimate target\n\n self.mode = \"train\"\n\n # --------------- #\n # Extract data\n # --------------- #\n # Extract the featurized data into train/dev/test\n encoded_train_data = enc_data[\"train\"]\n encoded_dev_data = enc_data[\"dev\"]\n encoded_test_data = enc_data[\"test\"]\n\n log.info(\"Training the mixers\")\n\n # --------------- #\n # Fit Models\n # --------------- #\n # Assign list of mixers\n self.mixers = [\n Neural(\n fit_on_dev=True,\n search_hyperparameters=True,\n net=\"DefaultNet\",\n stop_after=self.problem_definition.seconds_per_mixer,\n target_encoder=self.encoders[self.target],\n target=self.target,\n dtype_dict=self.dtype_dict,\n timeseries_settings=self.problem_definition.timeseries_settings,\n ),\n LightGBM(\n fit_on_dev=True,\n use_optuna=True,\n stop_after=self.problem_definition.seconds_per_mixer,\n target=self.target,\n dtype_dict=self.dtype_dict,\n input_cols=self.input_cols,\n target_encoder=self.encoders[self.target],\n ),\n Regression(\n stop_after=self.problem_definition.seconds_per_mixer,\n target=self.target,\n dtype_dict=self.dtype_dict,\n target_encoder=self.encoders[self.target],\n ),\n ]\n\n # Train mixers\n trained_mixers = []\n for mixer in self.mixers:\n try:\n mixer.fit(encoded_train_data, encoded_dev_data)\n trained_mixers.append(mixer)\n except Exception as e:\n log.warning(f\"Exception: {e} when training mixer: {mixer}\")\n if True and mixer.stable:\n raise e\n\n # Update mixers to trained versions\n self.mixers = trained_mixers\n\n # --------------- #\n # Create Ensembles\n # --------------- #\n log.info(\"Ensembling the mixer\")\n # Create an ensemble of mixers to identify best performing model\n self.pred_args = PredictionArguments()\n # Dirty hack\n self.ensemble = BestOf(\n ts_analysis=None,\n data=encoded_test_data,\n args=self.pred_args,\n accuracy_functions=self.accuracy_functions,\n target=self.target,\n mixers=self.mixers,\n )\n self.supports_proba = self.ensemble.supports_proba\n\n @timed\n def analyze_ensemble(self, enc_data: Dict[str, pd.DataFrame]) -> None:\n # Evaluate quality of fit for the ensemble of mixers\n\n # --------------- #\n # Extract data\n # --------------- #\n # Extract the featurized data into train/dev/test\n encoded_train_data = enc_data[\"train\"]\n encoded_dev_data = enc_data[\"dev\"]\n encoded_test_data = enc_data[\"test\"]\n\n # --------------- #\n # Analyze Ensembles\n # --------------- #\n log.info(\"Analyzing the ensemble of mixers\")\n self.model_analysis, self.runtime_analyzer = model_analyzer(\n data=encoded_test_data,\n train_data=encoded_train_data,\n ts_analysis=None,\n stats_info=self.statistical_analysis,\n tss=self.problem_definition.timeseries_settings,\n accuracy_functions=self.accuracy_functions,\n predictor=self.ensemble,\n target=self.target,\n dtype_dict=self.dtype_dict,\n analysis_blocks=self.analysis_blocks,\n )\n\n @timed\n def learn(self, data: pd.DataFrame) -> None:\n log.info(f\"Dropping features: {self.problem_definition.ignore_features}\")\n data = data.drop(\n columns=self.problem_definition.ignore_features, errors=\"ignore\"\n )\n\n self.mode = \"train\"\n\n # Perform stats analysis\n self.analyze_data(data)\n\n # Pre-process the data\n data = self.preprocess(data)\n\n # Create train/test (dev) split\n train_dev_test = self.split(data)\n\n # Prepare encoders\n self.prepare(train_dev_test)\n\n # Create feature vectors from data\n enc_train_test = self.featurize(train_dev_test)\n\n # Prepare mixers\n self.fit(enc_train_test)\n\n # Analyze the ensemble\n self.analyze_ensemble(enc_train_test)\n\n # ------------------------ #\n # Enable model partial fit AFTER it is trained and evaluated for performance with the appropriate train/dev/test splits.\n # This assumes the predictor could continuously evolve, hence including reserved testing data may improve predictions.\n # SETjson_ai.problem_definition.fit_on_all=FalseTO TURN THIS BLOCK OFF.\n\n # Update the mixers with partial fit\n if self.problem_definition.fit_on_all:\n\n log.info(\"Adjustment on validation requested.\")\n self.adjust(\n enc_train_test[\"test\"],\n ConcatedEncodedDs([enc_train_test[\"train\"], enc_train_test[\"dev\"]]),\n )\n\n @timed\n def adjust(\n self,\n new_data: Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame],\n old_data: Optional[Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame]] = None,\n ) -> None:\n # Update mixers with new information\n\n self.mode = \"train\"\n\n # --------------- #\n # Prepare data\n # --------------- #\n if old_data is None:\n old_data = pd.DataFrame()\n\n if isinstance(old_data, pd.DataFrame):\n old_data = EncodedDs(self.encoders, old_data, self.target)\n\n if isinstance(new_data, pd.DataFrame):\n new_data = EncodedDs(self.encoders, new_data, self.target)\n\n # --------------- #\n # Update/Adjust Mixers\n # --------------- #\n log.info(\"Updating the mixers\")\n\n for mixer in self.mixers:\n mixer.partial_fit(new_data, old_data)\n\n @timed\n def predict(self, data: pd.DataFrame, args: Dict = {}) -> pd.DataFrame:\n\n self.mode = \"predict\"\n\n if len(data) == 0:\n raise Exception(\n \"Empty input, aborting prediction. Please try again with some input data.\"\n )\n\n # Remove columns that user specifies to ignore\n log.info(f\"Dropping features: {self.problem_definition.ignore_features}\")\n data = data.drop(\n columns=self.problem_definition.ignore_features, errors=\"ignore\"\n )\n for col in self.input_cols:\n if col not in data.columns:\n data[col] = [None] * len(data)\n\n # Pre-process the data\n data = self.preprocess(data)\n\n # Featurize the data\n encoded_ds = self.featurize({\"predict_data\": data})[\"predict_data\"]\n encoded_data = encoded_ds.get_encoded_data(include_target=False)\n\n self.pred_args = PredictionArguments.from_dict(args)\n df = self.ensemble(encoded_ds, args=self.pred_args)\n\n if self.pred_args.all_mixers:\n return df\n else:\n insights, global_insights = explain(\n data=data,\n encoded_data=encoded_data,\n predictions=df,\n ts_analysis=None,\n timeseries_settings=self.problem_definition.timeseries_settings,\n positive_domain=self.statistical_analysis.positive_domain,\n anomaly_detection=self.problem_definition.anomaly_detection,\n analysis=self.runtime_analyzer,\n target_name=self.target,\n target_dtype=self.dtype_dict[self.target],\n explainer_blocks=self.analysis_blocks,\n pred_args=self.pred_args,\n )\n return insights\n", "created_at": "2022-06-14", "data_source_name": null, "dtype_dict": null, "error": "ufunc 'absolute' did not contain a loop with signature matching types dtype('<U3') -> dtype('<U3')", "json_ai": { "accuracy_functions": [ "r2_score" ], "dependency_dict": {}, "dtype_dict": { "aircraftType": "categorical", "airline": "categorical", "airlineType": "binary", "arrivalTime": "integer", "cabinClass": "categorical", "currency": "categorical", "departDay": "integer", "departTime": "integer", "destinationCountry": "categorical", "distance": "integer", "duration": "integer", "fareClass": "categorical", "flightCode": "categorical", "holiday": "binary", "holidayAfter": "integer", "holidayBefore": "integer", "holidayMonth": "integer", "holidayWeek": "integer", "isDom": "binary", "isWeekend": "binary", "originCountry": "categorical", "popularRoute": "binary", "quarantineDays": "integer", "rute": "categorical", "supplier": "categorical", "total": "float", "transitArrivalTime": "integer", "transitDepartTime": "integer", "type": "categorical", "uniqueFlight": "binary", "updateHour": "integer", "vtl": "binary", "weekday": "categorical" }, "encoders": { "arrivalTime": { "args": {}, "module": "BinaryEncoder" }, "cabinClass": { "args": {}, "module": "OneHotEncoder" }, "currency": { "args": {}, "module": "BinaryEncoder" }, "departDay": { "args": {}, "module": "NumericEncoder" }, "departTime": { "args": {}, "module": "BinaryEncoder" }, "fareClass": { "args": {}, "module": "BinaryEncoder" }, "holiday": { "args": {}, "module": "BinaryEncoder" }, "holidayAfter": { "args": {}, "module": "NumericEncoder" }, "holidayBefore": { "args": {}, "module": "NumericEncoder" }, "holidayMonth": { "args": {}, "module": "OneHotEncoder" }, "holidayWeek": { "args": {}, "module": "OneHotEncoder" }, "isWeekend": { "args": {}, "module": "BinaryEncoder" }, "supplier": { "args": {}, "module": "OneHotEncoder" }, "total": { "args": { "is_target": "True", "positive_domain": "$statistical_analysis.positive_domain" }, "module": "NumericEncoder" }, "uniqueFlight": { "args": {}, "module": "NumericEncoder" }, "updateHour": { "args": {}, "module": "NumericEncoder" }, "weekday": { "args": {}, "module": "OneHotEncoder" } }, "identifiers": { "aircraftType": "No Information", "airline": "No Information", "airlineType": "No Information", "destinationCountry": "No Information", "distance": "No Information", "duration": "No Information", "flightCode": "No Information", "isDom": "No Information", "originCountry": "No Information", "popularRoute": "No Information", "quarantineDays": "No Information", "rute": "No Information", "transitArrivalTime": "No Information", "transitDepartTime": "No Information", "type": "No Information", "vtl": "No Information" }, "model": { "args": { "accuracy_functions": "$accuracy_functions", "args": "$pred_args", "submodels": [ { "args": { "fit_on_dev": true, "search_hyperparameters": true, "stop_after": "$problem_definition.seconds_per_mixer" }, "module": "Neural" }, { "args": { "fit_on_dev": true, "stop_after": "$problem_definition.seconds_per_mixer" }, "module": "LightGBM" }, { "args": { "stop_after": "$problem_definition.seconds_per_mixer" }, "module": "Regression" } ], "ts_analysis": null }, "module": "BestOf" }, "problem_definition": { "anomaly_detection": false, "expected_additional_time": 14.5330491065979, "fit_on_all": true, "ignore_features": [], "pct_invalid": 2, "positive_domain": false, "seconds_per_encoder": null, "seconds_per_mixer": 57024.0, "seed_nr": 420, "strict_mode": true, "target": "total", "target_weights": null, "time_aim": 259200, "timeseries_settings": { "allow_incomplete_history": true, "eval_cold_start": true, "group_by": null, "historical_columns": null, "horizon": null, "interval_periods": [], "is_timeseries": false, "order_by": null, "target_type": "", "use_previous_target": true, "window": null }, "unbias_target": true, "use_default_analysis": true } }, "mindsdb_version": "22.3.1.0", "name": "PR2039$test", "predict": "total", "problem_definition": { "anomaly_detection": false, "expected_additional_time": null, "fit_on_all": true, "ignore_features": [], "pct_invalid": 2, "positive_domain": false, "seconds_per_encoder": null, "seconds_per_mixer": null, "seed_nr": 420, "strict_mode": true, "target": "total", "target_weights": null, "time_aim": null, "timeseries_settings": { "allow_incomplete_history": true, "eval_cold_start": true, "group_by": null, "historical_columns": null, "horizon": null, "interval_periods": [], "is_timeseries": false, "order_by": null, "target_type": "", "use_previous_target": true, "window": null }, "unbias_target": true, "use_default_analysis": true }, "status": "error", "update": "up_to_date", "updated_at": "2022-06-14" }

is my query create predictor error? any example or any resource for USING syntax, i read current documentation, but still don't get how to use it

thanks before

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

ianu82 commented 2 years ago

Hey Gattak, does just running the CREATE Predictor query (so everything before the USING statement) pass correctly?

gattack commented 2 years ago

Hey Gattak, does just running the CREATE Predictor query (so everything before the USING statement) pass correctly?

on my case above is it correct sql query?

CREATE PREDICTOR PR2039$test FROM jw_ch (select type,isDom,airline,flightCode,fareClass,cabinClass,rute,distance,currency,departDay,weekday,isWeekend,holidayWeek,holidayMonth,holiday,holidayBefore,holidayAfter,originCountry,destinationCountry,airlineType,duration,supplier,departTime,arrivalTime,transitDepartTime,transitArrivalTime,aircraftType,popularRoute,uniqueFlight,vtl,quarantineDays,total, toHour(updateAt) as updateHour from machine_learning.pp_train_data ptd where flightCode = 'PR2039') PREDICT total USING dtype_dict = '{ "type": "categorical", "isDom": "binary", "airline": "categorical", "flightCode": "categorical", "fareClass": "categorical", "cabinClass": "categorical", "rute": "categorical", "distance": "integer", "currency": "categorical", "departDay": "integer", "isWeekend": "binary", "holidayWeek": "integer", "holidayMonth": "integer", "holiday": "binary", "holidayBefore": "integer", "holidayAfter": "integer", "originCountry": "categorical", "destinationCountry": "categorical", "airlineType": "binary", "duration": "integer", "supplier": "categorical", "departTime": "integer", "arrivalTime": "integer", "transitDepartTime": "integer", "transitArrivalTime": "integer", "aircraftType": "categorical", "popularRoute": "binary", "uniqueFlight": "binary", "vtl": "binary", "quarantineDays": "integer", "total": "float", "updateHour": "integer" };

i run that query and result status error

ianu82 commented 2 years ago

@ZoranPandovski, are you able to help please?

ZoranPandovski commented 2 years ago

Hi @gattack can I ask why do you need to specify the columns types through USING? Is MindsDB detecting them wrong?

gattack commented 2 years ago

yes we find out from mindsdb log that mindsdb wrong to define a the column, they filter out important column for training

ZoranPandovski commented 2 years ago

Thanks, just to clarify MindsDB removed(filtered) some of the columns or did it detects the column type wrong?

gattack commented 2 years ago

for removed filed i think it has wrong , in type column thta don't have many variant like classification, just have 3 or 7 type value

ZoranPandovski commented 2 years ago

Closing, for now, @gattack if you still experience issues or need this resolved asap, feel free to reopen.