google / transit

https://gtfs.org/
Apache License 2.0
610 stars 183 forks source link

Best Practices: Reasonable lengths in unique ids #518

Open skinkie opened 14 hours ago

skinkie commented 14 hours ago

Describe the problem

Today, I was confronted with a GTFS feed that used 100+ character IDs for individual trip_ids. You may guess what the size of the feed looked like. Their old feeds used only 8 digits as trip_id.

Use cases

Efficient resource usage.

Proposed solution

I want to propose a 36 byte soft limit as best practice for any identifier used in GTFS. A UUID would fit, I would say even a NeTEx ServiceJourney or ScheduledStopPoint identifier would fit as whole. If a value exceeds 36 bytes, a nice warning can and should be presented.

leonardehrenfried commented 14 hours ago

I must admit this one made me laugh - no matter how many rules the spec imposes, data producers will always find new ways to create a mess.

Famous last words: "36 characters ought to be enough for anyone."

laurentg commented 14 hours ago

I'm not certain which problem we are trying to solve here. The zip will compress large and multiple IDs nicely, and trip count is never that huge to prevent storing data, even in memory. Also giving an (arbitrary) limit could be taken as an excuse for some re-users to justify not being able to consume feeds with a few IDs larger than this limit. I've encountered some re-users that cannot ingest GTFS with IDs longer than 80 or 255 chars for example, even if only a few IDs are above that threshold.

In summary I'm rather against this; for me this is rather useless, somehow arbitrary and open to misinterpretation.

skinkie commented 13 hours ago

@laurentg in this case the feed of "last week" was 74MB compressed, and this week 890MB compressed. For compression (itself) to work properly, some things must be guaranteed first. For example, the data in the files are sorted. But this is not about compression or not, processing and running matching still requires this idiotic long strings to be stored in memory, unless the implementation throws them overboard anyhow and creates hashes.

With respect to your other comment, that it is never too big to store something in memory. 372996 trips, multiplied by 100 is indeed "only" 37MB. But it could also have been just 2.2MB.

The example that you give, that there exists GTFS-ids with a length of 80 - 255 already shows we need to have a best practice. Nobody in their sane mind has more than 10^80 stops in the network.