m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Generate synthetic UUID annotations for historical data #876

Open pboothe opened 4 years ago

gfr10598 commented 4 years ago

We will eventually want synthetic annotations for all NDT tests, legacy paris traceroute tests including hops, pcap files, neubot.

Possible sources of UUIDs:

ndttrace and speed tests share the same filename prefixes

At a minimum, we need to produce annotations for all paris-traceroute tests, and NDT tests prior to implementation of paris-traceroute. It looks like ndttrace capture was implemented very early, so we should be able to use that as the basis for synthetic UUID generation and annotations, and produce pcaps indices at the same time.

gfr10598 commented 4 years ago

Since the pcap and web100 data are included in the NDT archives, they can easily share a single UUID. However, we also want to get UUIDs and annotations for all sidestream connections? And for all paris-traceroutes and internal hops?

Do we want to duplicate PT entries for multiple connections (tests and sidestream) from same client IP?

Do we want to hash the IP address in the UUID, or leave it in plain text? Sequence numbers are impractical across archives and datatypes. For NDT and sidestream, we could probably hash the IP and the timestamp together. We have millions of connections per day, so we would want approximately 10^16 hash space, which is at least 53 bits, 14 hex chars, or 9 base64 chars.