hasadna / open-bus

:bus: Analysing Israel's public transport data
93 stars 27 forks source link

SiriRide calculated attributes - v1 #315

Open EyalBerger opened 4 years ago

EyalBerger commented 4 years ago

Following our last discussions and the restarting of SiriRide entity task ,creating this issue for listing planned SiriRide calculated attributes (v1).

I have separated them into different classes according to their complexity:

Class Attr name Attr desc Comments
1 agency_id
1 route_id
1 route_short_name
1 bus_id
1 planned_start_date
1 planned_start_time
1 points_time_list list of points timestamp by time_recorded
1 points_latlon_list list of points latlon
2 points_cnt number of Geo points in SiriRide
3 ride_in_gtfs specific ride is listed in the GTFS agency_id + route_id + planned_start_date + planned_start_time
3 ride_date_in_gtfs ride date is listed in the GTFS agency_id + route_id + planned_start_date
3 ride_route_in_gtfs ride route is listed in the GTFS agency_id + route_id
3 ride_agency_in_gtfs ride agency is listed in the GTFS agency_id
4 stops_matching_pct_500 stops match percentage with buffer of 500m over each stop will be calculated only for ride_date_in_gtfs = 1
4 stops_matching_pct_1000 stops match percentage with buffer of 1000m over each stop will be calculated only for ride_date_in_gtfs = 1
4 start_time_est estimated ride start time in first station will be calculated only for X% of stops_matching_pct_1000
4 end_time_est estimated ride end time in last station will be calculated only for X% of stops_matching_pct_1000
4 driving_time_est estimated driving time from first station to last station will be calculated only for X% of stops_matching_pct_1000
4 driving_speed_est estimated average driving speed from first station to last station will be calculated only for X% of stops_matching_pct_1000

It's a very initial list. Please edit it with your own insights.

evyatark commented 4 years ago

I think we should add the attribute "makat number", in addition to agency_id, route_id, route_short_name.

cjer commented 4 years ago

Some thoughts and fields I think we also need:

AvivSela commented 4 years ago
  1. we could also have class for siri-ride that have multiple siri-records (with date time and lat-lon attributes), it could be easier to have one list than two.
  2. what do you say about merge together the planned_start_date and time?
  3. in case we are going to have those 2 classes (siri-ride and siri-record) we could have in each of them "analytics" member that holds dictionary with all the metrics. for example "points_cnt" will be in siri-ride object while "speed" will be in siri-record.
adiwaz commented 4 years ago

I like the idea of dividing the variables into complexity classes. My suggestions/comments:

  1. I think we should classify each variable by 2 criteria:
    1. Data needed (siri ride only, gtfs, etc.)
    2. Data Science work needed (e.g. straightforward aggregation vs statistical model required)
  2. I don't understand the difference between complex calculations (class 4) and models (class 6). I prefer more clear definition to the data science solution complexity (see above), that do not require us to decide in advance which type of DS solution (ML/statistical model...) will be the best for each "complex" variable.
  3. Le'ts add dependencies - if driving time requires start_time and end_time and given them it is straightforward calculation - let's mention it.
  4. On top of Dan's suggestions I would also add:
    1. total_ride_time_raw : time from first non 0 time point until the last one. This variable will help us to easily detect data anomalies with too long and too short rides.
    2. is_match_route: is the route ID mentioned in SIRI matching the expected route shape (from GTFS)?
  5. I didn't understand the variables: stops_matchingpct*. Maybe add further description?
  6. In general I think we should focus now on defining and creating the "straightforward" variables, and later focus on variables that require statistics/modeling.
adiwaz commented 4 years ago

@AvivSela - I didn't understand your suggestion in (1), what is the purpose of each class? Why should they be separated? Regarding (2) - I think that merging the date and time can hurt efficiency of indexing. Maybe we would like to index the date and not the time.

AvivSela commented 4 years ago
  1. It's more easy to loop them:
    for ind, point_time in enumerate(points_time_list):
    time = point_time 
    lat, lon = points_latlon_list[ind]

    Vs.

    for record in records:
    time = record.time
    lat, lon = record.latlon
  2. it's less error prone in case we will need to add new record that should be splitted to two and insert into the same index in both list.
  3. it's more easy to sort in case of modifications.
EyalBerger commented 4 years ago

Thanks you all for your comments and insights!

I updated the design following it.

The variables list became too long so I ended up opening a design doc for it. Please see here.

In summary:

  1. I added most of the suggested variables (see exceptions in the "open issues" section below) and some more (total ~30 raw data/straightforward calculations and ~10 complex calculations).
  2. I added variables dependencies.
  3. I separated the data categories (what was called "classes" in the previous comment) by the 2 criteria @adiwaz mentioned.

Open issues:

  1. "makat number" - @evyatark I didn't found the column in Splunk siri data. What is the meaning of this column? do you know its "Splunk" name?
  2. "is_match_route" - @adiwaz, I think that for this version it will be more simple (from IT and DS perspective) not to use GTFS shape files, and build our "match route" variables based over GTFS route_stats only (stops data).
  3. SiriRecord class - @AvivSela ,I assume this is more IT-related issue rather than data-related issue.
  4. planned_end_datetime_gtfs - I didn't found this data in gtfs route_stats. We don't get/collect it?
EyalBerger commented 4 years ago

Following 15/4 Zoom meeting, some required updates in the data design:

EyalBerger commented 4 years ago

I added data types and update dependencies (when variable based directly on Siri or GTFS) to the data design.

AvivSela commented 4 years ago

Hi, I looked at SIRI 2.8. it might take some time but we will get there. there are some more fields there that come "free of charge" without the need to calculate them. Here is example of the JSON format: ICD_SM_2_8_ver25.pdf

{
    "-version": "2.8",
    "ResponseTimestamp": "2020-10-16T06:32:30+03:00",
    "Status": "true",
    "MonitoredStopVisit": [
        {
            "RecordedAtTime": "2020-10-16T06:32:19+03:00",
            "ItemIdentifier": "1455075547",
            "MonitoringRef": "47507",
            "MonitoredVehicleJourney": {
                "LineRef": "28209",
                "DirectionRef": "1",
                "FramedVehicleJourneyRef": {
                    "DataFrameRef": "2020-10-16",
                    "DatedVehicleJourneyRef": "50698246"
                },
                "PublishedLineName": "52",
                "OperatorRef": "3",
                "DestinationRef": "47453",
                "OriginAimedDepartureTime": "2020-10-16T06:25:00+03:00",
                "VehicleLocation": {
                    "Longitude": "35.079803",
                    "Latitude": "32.823952"
                },
                "Bearing": "8",
                "Velocity": "29",
                "VehicleRef": "7576269",
                "MonitoredCall": {
                    "StopPointRef": "47507",
                    "Order": "26",
                    "ExpectedArrivalTime": "2020-10-16T06:49:00+03:00",
                    "DistanceFromStop": "4009"
                }
            }
        }
    ]
}

If im taking those fields combine them into one object that represent a ride that have list of records with the observation over time i will get the following schema:

SiriRide
    LineRef: "Reference to a LINE"
    DirectionRef: "Reference to a DIRECTION the VEHICLE is running along the LINE"
    FramedVehicleJourneyRef_DataFrameRef: "The date part of the trip ID"
    FramedVehicleJourneyRef_DatedVehicleJourneyRef: "The number part of trip ID"
    PublishedLineName: "The bus number, as published on the bus"
    OperatorRef: "The Operator code"
    DestinationRef: "The destination stop code"
    VehicleRef: "Vehicle number. The value should match the license number of the Vehicle"
    OriginAimedDepartureTime: "The start time of the Journey, according to the licensing system" The value should match DepartureTime at TripIdToDate.txt file at the GTFS"
    SiriRecords
        ResponseTimestamp: "The time of the Response"
        RecordedAtTime: "Time at which data was recorded at the Vehicle"
        VehicleLocation
            Longitude: Latitude from equator
            Latitude: Latitude from equator
        Bearing: "Vehicle bearing with respect to the North"
        Velocity: "Vehicle speed at Km/h."
        StopPointRef: "The stop code of the stop that the Vehicle is stopping at now, or recently visited"
        Order: "The stop order of the stop that the Vehicle is stopping at now, or recently visited"
        DistanceFromStop: "The distance that the Vehicle travelled from the start of the journey. in meters"
 {
  "title": "SiriRide",
  "type": "object",
  "properties": {
    "LineRef": {
      "title": "Lineref",
      "description": "Reference to a LINE ",
      "type": "integer"
    },
    "DirectionRef": {
      "title": "Directionref",
      "description": "Reference to a DIRECTION the VEHICLE is running along the LINE",
      "type": "integer"
    },
    "FramedVehicleJourneyRef_DataFrameRef": {
      "title": "Framedvehiclejourneyref Dataframeref",
      "description": "The date part of the trip ID",
      "type": "string",
      "format": "date-time"
    },
    "FramedVehicleJourneyRef_DatedVehicleJourneyRef": {
      "title": "Framedvehiclejourneyref Datedvehiclejourneyref",
      "description": "The number part of trip ID",
      "type": "integer"
    },
    "PublishedLineName": {
      "title": "Publishedlinename",
      "description": "The bus number, as published on the bus",
      "type": "string"
    },
    "OperatorRef": {
      "title": "Operatorref",
      "description": "The Operator code",
      "type": "integer"
    },
    "DestinationRef": {
      "title": "Destinationref",
      "description": "The destination stop code",
      "type": "integer"
    },
    "VehicleRef": {
      "title": "Vehicleref",
      "description": "Vehicle number. The value should match the license number of the Vehicle",
      "type": "integer"
    },
    "OriginAimedDepartureTime": {
      "title": "Originaimeddeparturetime",
      "description": "The start time of the Journey, according to the licensing system\" The value should match DepartureTime at TripIdToDate.txt file at the GTFS",
      "type": "string",
      "format": "date-time"
    },
    "SiriRecords": {
      "title": "Sirirecords",
      "description": "represent one observation on a vehicle over time",
      "type": "array",
      "items": {
        "$ref": "#/definitions/SiriRecord"
      }
    }
  },
  "required": [
    "LineRef",
    "DirectionRef",
    "FramedVehicleJourneyRef_DataFrameRef",
    "FramedVehicleJourneyRef_DatedVehicleJourneyRef",
    "PublishedLineName",
    "OperatorRef",
    "DestinationRef",
    "VehicleRef",
    "OriginAimedDepartureTime",
    "SiriRecords"
  ],
  "definitions": {
    "GeoPoint": {
      "title": "GeoPoint",
      "type": "object",
      "properties": {
        "Longitude": {
          "title": "Longitude",
          "description": "Latitude from equator",
          "type": "number"
        },
        "Latitude": {
          "title": "Latitude",
          "description": "Latitude from equator",
          "type": "number"
        }
      },
      "required": [
        "Longitude",
        "Latitude"
      ]
    },
    "SiriRecord": {
      "title": "SiriRecord",
      "type": "object",
      "properties": {
        "ResponseTimestamp": {
          "title": "Responsetimestamp",
          "description": "The time of the Response",
          "type": "string",
          "format": "date-time"
        },
        "RecordedAtTime": {
          "title": "Recordedattime",
          "description": "Time at which data was recorded at the Vehicle",
          "type": "string",
          "format": "date-time"
        },
        "VehicleLocation": {
          "title": "Vehiclelocation",
          "description": "Vehicle Location",
          "allOf": [
            {
              "$ref": "#/definitions/GeoPoint"
            }
          ]
        },
        "Bearing": {
          "title": "Bearing",
          "description": "Vehicle bearing with respect to the North",
          "minimum": 0,
          "maximum": 360,
          "type": "integer"
        },
        "Velocity": {
          "title": "Velocity",
          "description": "Vehicle speed at Km/h.",
          "minimum": 0,
          "type": "integer"
        },
        "StopPointRef": {
          "title": "Stoppointref",
          "description": "The stop code of the stop that the Vehicle is stopping at now, or recently visited",
          "type": "integer"
        },
        "Order": {
          "title": "Order",
          "description": "The stop order of the stop that the Vehicle is stopping at now, or recently visited",
          "type": "integer"
        },
        "DistanceFromStop": {
          "title": "Distancefromstop",
          "description": "The distance that the Vehicle travelled from the start of the journey. in meters",
          "type": "integer"
        }
      },
      "required": [
        "ResponseTimestamp",
        "RecordedAtTime",
        "VehicleLocation",
        "Bearing",
        "Velocity",
        "StopPointRef",
        "Order",
        "DistanceFromStop"
      ]
    }
  }
}