vingerha / gtfs2

Support GTFS in Home Assistant GUI-only
https://github.com/vingerha/gtfs2
MIT License
65 stars 4 forks source link

[FEATURE]: Merge/Update old Static Data with New Static Data, purging outdated Services #71

Closed dafunkydan closed 1 month ago

dafunkydan commented 1 month ago

Describe the solution you'd like My Local Transportation publishes once a Week a new Static Dataset, and thinks it is a good idea to publish Data not valid yet. Sigh. So if i Update, it overwrites the old, still valid Data with new, not yet valid Data.

E.g. right now (20240516), in the latest Zip https://rnv-dds-prod-gtfs.azurewebsites.net/latest/gtfs.zip in calendar.txt you can see Data like:

"start_date","end_date","monday","tuesday","wednesday","thursday","friday","saturday","sunday","service_id"
"20240517","20240531","1","1","1","1","1","0","0","138-142-143-144-145-148-149-150-152"
"20240517","20240531","1","1","1","1","1","1","1","138-139-140-141-142-143-144-145-146-147-148-149-150-151-152"
"20240517","20240531","1","1","1","1","1","1","0","138-139-142-143-144-145-146-148-149-150-152"
"20240517","20240531","1","1","1","1","1","0","1","138-140-141-142-143-144-145-147-148-149-150-151-152"
"20240518","20240530","1","0","0","1","0","1","1","139-140-141-146-147-151"
"20240518","20240525","0","0","0","0","0","1","0","139-146"
"20240518","20240518","0","0","0","0","0","1","0","139"
........

And the overwritten, not again downloadable zip, is the real valid one, calendar.txt containing Lines like:

"start_date","end_date","monday","tuesday","wednesday","thursday","friday","saturday","sunday","service_id"
"20240510","20240524","1","1","1","1","1","0","0","131-134-135-136-137-138-142-143-144-145"
"20240510","20240524","1","1","1","1","1","1","1","131-132-133-134-135-136-137-138-139-140-141-142-143-144-145"
"20240510","20240524","1","1","1","1","1","1","0","131-132-134-135-136-137-138-139-142-143-144-145"
"20240510","20240524","1","1","1","1","1","0","1","131-133-134-135-136-137-138-140-141-142-143-144-145"
"20240511","20240520","1","0","0","0","0","1","1","132-133-139-140-141"
"20240511","20240518","0","0","0","0","0","1","0","132-139"
"20240512","20240520","1","0","0","0","0","0","1","133-140-141"
"20240510","20240523","1","1","1","1","1","0","0","131-134-135-136-137-138-142-143-144"
"20240510","20240523","1","1","1","1","1","1","0","131-132-134-135-136-137-138-139-142-143-144"
"20240510","20240523","1","1","1","1","1","1","1","131-132-133-134-135-136-137-138-139-140-141-142-143-144"
"20240510","20240523","1","1","1","1","1","0","1","131-133-134-135-136-137-138-140-141-142-143-144"
"20240510","20240517","1","1","1","1","1","0","0","131-134-135-136-137-138"
"20240521","20240524","0","1","1","1","1","0","0","142-143-144-145"
"20240524","20240524","0","0","0","0","1","0","0","145"

It would be awesome to have the Data not overwritten, but merged - where datasets with end_date before today getting deleted.

I fear it is technically not a simple task, or even just my local transportation (again). However, as gtfs seems to provide such an Option by the Calendar Key, it might be relevant to more people (possibly less frequent).

Describe alternatives you've considered I need to manually check, if the new file might be too fresh, and try it the next day again.

Additional context No other Context

vingerha commented 1 month ago

I fear it is technically not a simple task

Try searching the web for it... :) I had a similar idea as my local area is supported by 4 providers (3 bus 1 train) and I have to search each one of them separately...esp. for local_stops this is not nice.. The solutions on the web proved way too complex vs. the gain (my view on my gain to be honest) In my case, the new data is overlapping or at a minimum it is created for the day that it has data .... so I never had an issue (so far) Another challenge of above proposal is ...what to do after the merge as in 1 week you have the same situation again and the merg e will grow and grow. A 'simple' (possibly simple?) thing I can try to look into is to search for the youngest date in calendar and calendar_dates and if this is newer than today: abort

dafunkydan commented 1 month ago

Try searching the web for it... :)

Probably it is like googling diagnoses and bring them to the Doctor.... ;-)

I can try to look into is to search for the youngest date in calendar and calendar_dates and if this is newer than today: abort

Given that the data are not incremental or overlapping (haven't looked at it yet, by now i never missed "some" connections), that would be awesome, as it basically does what i would do manually every time 💯

vingerha commented 1 month ago

It was more meant as: see how not-at-all-easy this is :)

dafunkydan commented 1 month ago

It was more meant as: see how not-at-all-easy this is :)

....i know i know.... what i meant was - i am really good at searching so long, til i find the Article "Rocket to the Moon with these 10 simple Steps" :-)

vingerha commented 1 month ago

The 'simple' solution is not very technical but it affects the current process quite a bit. download zip file > extract calendar.txt > page through column 1 and add to list > find min-value from list > do something or abort > delete calendar.txt

vingerha commented 1 month ago

Now I need motivation ... and since this does not help ME ... not that easy :)

dafunkydan commented 1 month ago

Now I need motivation ... and since this does not help ME ... not that easy :)

I was hoping letting you smile is an appropriate way 😉 Well - implementing such a check would not only prevent other ppl seeing no connections (and probably file bug reports), but make me extremely happy, as I can more and more rely on your awesome integration! 😀

vingerha commented 1 month ago

the need of many outweigh the needs of few

vingerha commented 1 month ago

or better: the need of many (or myself) outweigh the needs of few

dafunkydan commented 1 month ago

implementing such a check would not only prevent other ppl seeing no connections (and probably file bug reports)

If this not only fulfills both the need of many and yourself, I don't know 😀

vingerha commented 1 month ago

wel... till now you are the only one ... so you are free to provide a PR (LOL)

dafunkydan commented 1 month ago

I would love to continue joking, but am not sure if it might lead to the wrong direction... 😉

vingerha commented 1 month ago

I did a quick check with code and all and sadly it sucks again...this is from the smallest dataset I have (calendat.txt)


service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
B_SEPT23-A04-Lun-Mar-10,1,1,0,0,0,0,0,20240506,20240507
B_SEPT23-A04-Lun-Mar-10-0100000-B_SEPT23-A05-Jeu-Ven-00-0000100-B_SEPT23-A05-Lun-Mar-00-0100000,0,0,0,0,0,0,0,20240506,20240707
B_SEPT23-A04-Lun-Mar-10-0100000-B_SEPT23-A05-Jeu-Ven-00-0000100-B_SEPT23-A05-Lun-Mar-00-0100000-B_SEPT23-A05-Mercredi-00-0010000,0,0,0,0,0,0,0,20240506,20240707
B_SEPT23-A04-Lun-Mar-10-1000000-B_SEPT23-A05-Jeu-Ven-00-0001000-B_SEPT23-A05-Lun-Mar-00-1000000,0,0,0,0,0,0,0,20240506,20240707
vingerha commented 1 month ago

Means I have to build in soo many checks it is really a pita ....sorry but at this point I am not spending this much effort on it.

dafunkydan commented 1 month ago

Too bad 😢 But totally see your point.

Any best-practice Ideas what I could do? Right now I think about querying the size, delay the download for like two days, and then see (manually or by not seeing connections) if it is already valid, and if not, restore the previous data.

I hassle with my provider anyway. Why would one publish data if they are not valid? they could easily include all valid connections at the moment of publishing.....

vingerha commented 1 month ago

I will give it one more try...only because I hate having simple challenges :) But... the next simplest way could be:

vingerha commented 1 month ago

So I have a rudimentary version and I am not sure if you are aware of 'calendar_dates' file Logic is now: if either in calendar.tx or calendar_dates.txt a date arises that is newer then today then 'abort' It assumes that their is either a column 'date' or 'start-date' and filters " ... so I do hope that there is no other character in ohter datasources

#Load the ZIP archive
zin = zipfile.ZipFile (f"{os.path.join(gtfs_dir, filename)}", 'r')
check_list=[]
for item in zin.infolist():
    if item.filename[0:8] == 'calendar' :
        if item.filename == 'calendar.txt':
            column = 'start_date'
        else:
            column = 'date'
        with open(zin.extract(item.filename)) as f:
            header = f.readline().strip('\n')   #
            data = f.readlines() 
            index =header.replace('"','').split(',').index(column)           
            list = []
            for line in data:
                list.append(line.split(',')[index])
            check_list.append(min(list))
        print(check_list)
        min_date = datetime.datetime.strptime(min(check_list),"%Y%M%d")
        if min_date > datetime.datetime.now()  :
            print('aborting')
vingerha commented 1 month ago

The trick is still that I need to retain a copy of the existing sqlite in case it is newer...coding is sooo much fun

dafunkydan commented 1 month ago

Thank you so much for getting into it again and giving it another chance! Highly appreciate it! However, i have the fear:

if either in calendar.tx or calendar_dates.txt a date arises that is newer then today then 'abort'

that would prevent my statics from beeing used. Having a look at my calendar.txt for the "future" statics, it shows that:

"start_date","end_date","monday","tuesday","wednesday","thursday","friday","saturday","sunday","service_id"
"20240517","20240531","1","1","1","1","1","0","0","138-142-143-144-145-148-149-150-152"
........
"20240527","20240527","1","0","0","0","0","0","0","148"

There are some Connections valid from Today on, but some in 10 Days (and they update the whole static file normally once a week). So that last line in calendar.txt would prevent the zip beeing used, and then there will be another new zip arriving in 7 Days.

I am really sorry, but do you see any Chance / would you consider it to be practical not to check for the "latest" Line beeing in the future, but to allow the Static file as soon as one Line is valid >= today?

Again, thank you so much for all your efforts and Time, highly appreciate it!

vingerha commented 1 month ago

Badly formulated... if in either files the youngest (!) date is newer than today, then abort

dafunkydan commented 1 month ago

Awesome! Younger, Older, Newer - probably a Question of the Point of View 😉 But I think I got it. Thrilled about that!

vingerha commented 1 month ago

So... this was a lot of work vs gain (for me 0 gain)...asking you to test this before I deploy it

attached the only file that was changed, need to remove the .txt of course

gtfs_helper.py.txt

dafunkydan commented 1 month ago

Wicked! Currently i am at Work, for sure i will try that asap! I think for a Real-Life stress Test we'll have to wait til the next Static Data with future-only Data gets published.

Am excited, thank you so much for everything! And, of course, always let me know if there are things to test before deployed 😎

vingerha commented 1 month ago

My unit tests were performed with one fiel where I replaced the calendar.txt and calendar_dates.txt to trigger it....which brought one after the other issue.... this has cost me about 4 hours in total

dafunkydan commented 1 month ago

Oh no.... Now I feel a bit bad. I'm really thrilled to check it out! And am sure validity checks are a good Thing. Thank you so much for all your efforts!

vingerha commented 1 month ago

You should feel bad :) but...well...my 'prize' is to have learned a bit more At my going rate this would have been a very nice diner-for-2 in a 3 michelin star restaurant...but I am not paid for being a dev (then it would be the snackbar round the corner)

dafunkydan commented 1 month ago

Am so sorry, but:

2024-05-17 19:17:28.449 DEBUG (SyncWorker_24) [custom_components.gtfs2] Updating GTFS with: {'extract_from': 'url', 'url': 'https://rnv-dds-prod-gtfs.azurewebsites.net/latest/gtfs.zip', 'clean_feed_info': False, 'file': 'GTFS-Static-RNV1'}
2024-05-17 19:17:28.449 DEBUG (SyncWorker_24) [custom_components.gtfs2.gtfs_helper] Getting gtfs with data: {'extract_from': 'url', 'url': 'https://rnv-dds-prod-gtfs.azurewebsites.net/latest/gtfs.zip', 'clean_feed_info': False, 'file': 'GTFS-Static-RNV1'}
2024-05-17 19:17:28.450 DEBUG (SyncWorker_24) [custom_components.gtfs2.gtfs_helper] Checking if extracting: GTFS-Static-RNV1
2024-05-17 19:17:28.451 INFO (SyncWorker_24) [custom_components.gtfs2.gtfs_helper] Removing datasource: /config/gtfs2/GTFS-Static-RNV1.*
2024-05-17 19:17:30.001 DEBUG (SyncWorker_24) [custom_components.gtfs2.gtfs_helper] Checking if file contains only future data: GTFS-Static-RNV1.zip 
2024-05-17 19:17:30.006 ERROR (SyncWorker_24) [custom_components.gtfs2.gtfs_helper] Error getting earliest dates from zip, continuing with extract, error: time data '"20240517"' does not match format '%Y%m%d'
2024-05-17 19:17:30.006 DEBUG (SyncWorker_24) [custom_components.gtfs2.gtfs_helper] Removing/restoring sqlite after error
2024-05-17 19:17:30.009 INFO (SyncWorker_24) [custom_components.gtfs2.gtfs_helper] New file contains only dates in the future, extracting terminated

There seems to be a Problem with parsing the Date from the calendar.txt The Behaviour not to continue in case of an Error is probably a good one. But hopefully that wouldnt produce errors on another end...? Anyway, in my Case calendar.txt has the Format

"start_date","end_date","monday","tuesday","wednesday","thursday","friday","saturday","sunday","service_id"
"20240517","20240531","1","1","1","1","1","0","0","138-142-143-144-145-148-149-150-152"

I tried a Lucky Punch by altering Line 459 (It was the only Line i found with %Y%m%d) 😆

min_date = datetime.datetime.strptime(min(check_list),"%YYYY%mm%dd")

But the Result is the same unfortunatly.... You know - if that develops beeing a Rabbit-Hole i will find some Duct-Tape to workaround that!