oeg-upm / gtfs-bench

GTFS-Madrid-Bench: A Benchmark for Knowledge Graph Construction Engines
https://doi.org/10.5281/zenodo.3574492
Apache License 2.0
17 stars 12 forks source link

exact_times in CSV is 0 while for RDB it is NULL #128

Closed arenas-guerrero-julian closed 3 years ago

arenas-guerrero-julian commented 3 years ago

Column exact_times in CSV FREQUENCIES.csv have 0 values while column exact_times for table FREQUENCIES in RDB have NULL values.

This causes some engines to provide different number of triples, as NULL values for references are to be ignored (according to R2RML spec) when generating triples.

daniel-dona commented 3 years ago

This is the origin of the difference:

https://github.com/oeg-upm/gtfs-bench/blob/205ce1a5aa9aa4eb1407c0138e5320ba3ece0524/composer/app.py#L280-L291

But I can't see how the 0 become NULL; it should be 0 and only NULL if exact_times is an empty value... maybe some problem importing the CSV file to MySQL? Needs further investigation.

dachafra commented 3 years ago

The problem is in the input dataset, the value of exact_time is always NULL, bc it's optional in the GTFS model and our dataset does not provide that information. What I don't understand is why the output contains 0...

daniel-dona commented 3 years ago

The problem is in the input dataset, the value of exact_time is always NULL, bc it's optional in the GTFS model and our dataset does not provide that information. What I don't understand is why the output contains 0...

The data that VIG uses have NULL values:


DROP TABLE IF EXISTS `FREQUENCIES`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `FREQUENCIES` (
  `trip_id` varchar(200) NOT NULL,
  `start_time` varchar(200) NOT NULL,
  `end_time` varchar(200) DEFAULT NULL,
  `headway_secs` int(11) DEFAULT NULL,
  `exact_times` int(11) DEFAULT '0',
  PRIMARY KEY (`trip_id`,`start_time`),
  CONSTRAINT `FREQUENCIES_ibfk_1` FOREIGN KEY (`trip_id`) REFERENCES `TRIPS` (`trip_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `FREQUENCIES`
--

LOCK TABLES `FREQUENCIES` WRITE;
/*!40000 ALTER TABLE `FREQUENCIES` DISABLE KEYS */;
INSERT INTO `FREQUENCIES` VALUES ('4_I12-001_2016I12_1_1_4__1___','00:00:00','01:00:00',600,NULL),('4_I12-001_2016I12_1_1_4__1___','01:00:00','02:00:00',900,NULL),('4_I12-001_2016I12_1_1_4__1___','06:05:00','07:00:00',540,NULL)

 [...]

And confirmed:

mysql> select * from FREQUENCIES LIMIT 10;
+-------------------------------+------------+----------+--------------+-------------+
| trip_id                       | start_time | end_time | headway_secs | exact_times |
+-------------------------------+------------+----------+--------------+-------------+
| 4_I12-001_2016I12_1_1_4__1___ | 00:00:00   | 01:00:00 |          600 |        NULL |
| 4_I12-001_2016I12_1_1_4__1___ | 01:00:00   | 02:00:00 |          900 |        NULL |
| 4_I12-001_2016I12_1_1_4__1___ | 06:05:00   | 07:00:00 |          540 |        NULL |
| 4_I12-001_2016I12_1_1_4__1___ | 07:00:00   | 09:00:00 |          480 |        NULL |
| 4_I12-001_2016I12_1_1_4__1___ | 09:00:00   | 18:00:00 |          480 |        NULL |
| 4_I12-001_2016I12_1_1_4__1___ | 18:00:00   | 21:30:00 |          360 |        NULL |
| 4_I12-001_2016I12_1_1_4__1___ | 21:30:00   | 23:00:00 |          390 |        NULL |
| 4_I12-001_2016I12_1_1_4__1___ | 23:00:00   | 24:00:00 |          450 |        NULL |
| 4_I12-001_2016I12_2_1_4__1___ | 00:00:00   | 01:00:00 |          600 |        NULL |
| 4_I12-001_2016I12_2_1_4__1___ | 01:00:00   | 02:00:00 |          900 |        NULL |
+-------------------------------+------------+----------+--------------+-------------+
10 rows in set (0.00 sec)

Maybe a VIG bug?

dachafra commented 3 years ago

I'll make some tests locally

dachafra commented 3 years ago

seems that the problem is here: SET exact_times = IF(exact_times='',NULL,exact_times);

daniel-dona commented 3 years ago

Possible problems on SET statements while importing data, further investigation needed.

Relevant:

LOAD DATA interprets an empty field value differently from a missing field:

https://dev.mysql.com/doc/refman/8.0/en/load-data.html https://dev.mysql.com/doc/refman/8.0/en/datetime.html

dachafra commented 3 years ago

if there are other issues regarding set in the loading but is not in exact_times (as it has been solved) please @daniel-dona open other issues