[FEATURE]: Use a remote Database, using a "connection string" like in HA

FabienD74 commented 6 months ago

Hi, I'm using HAOS on a Virtual Machine ( proxmox).

I already moved the Home Assistant DB to another VM running Postgresql. => i can restore HAOS without restoring the DB. => I can have different schedules for HAOS and DB => I can keep HAOS "small" (32Gb) so it is fast to backup/restore. => I guess it should outperform SQLite ( dedicated memory, native multi-core operations, backup, online DB vacuum...)

I would like to externalize the DB of GTFS, like i did for HA, using a "DB connection string"... for the same reasons :-) This will point to another PostgreSQL DB... to keep stuffs separated/isolated.

Not sure if it is "easy", ... but it would be great for flexibility/management,...

Regards Fabien

vingerha commented 6 months ago

Just to try and properly understand the request... you would like to store the gtfs-extracted data (from the gtfs-zip) in another database (i.e. flexible) than sqlite. Additional info: at this very moment I am using a library (pygtfs) that extracts the zip into a sqlite and there is no option to change db in pygtfs. I am not in any way planning to take over and maintain that library just-for-this...as you may understand :). So any additional work would be to migrate the sqlite from pygtfs to something else.

Additional comment on your points...do DO let me understand the need better if I miss a point

=> i can restore HAOS without restoring the DB.

I understand but ... restore should be a rare occasion...this is my greatest 'I challenge' point ....the compacted size of the sqlite is 'nothing'. .....the gtfs data changes over time, I need to re-download every month for 2 of mine, a back is imo not even needed. .... re-unpacking a new source costs max (largest ones till date) 2.5 hrs

=> I can have different schedules for HAOS and DB

donot understand, schedule for backup?

=> I can keep HAOS "small" (32Gb) so it is fast to backup/restore.

see above, I donot backup any of my gtfs db at present, I only have four gtfs-db right now

=> I guess it should outperform SQLite ( dedicated memory, native multi-core operations, backup, online DB vacuum...)

I have no data to confirm/reject this...even if I have had no issues, this is personal and I have not yet received issues on this topic.... static data is usually not performance limited

FabienD74 commented 6 months ago

Hi, Thanks for your reply. It was more a question than a feature request. To see feasability/complexity....

yes i would like to get rid of SQLite at all if possible... ;-) ... I don't trust Sqlite....

Thanks anyway.

Bonne soirée. Fab

vingerha commented 6 months ago

I do get the idea...dont get me wrong but I donot see a use-case here as the data can be recovered from the external source at any point in time. For the possible performance improvementt, not sure if this applies here...qed by others like e.g. you?

Will keep it open for a while but need incentives :)

vingerha commented 5 months ago

I now analysed this a bit further. As gtfs2 is leaning on pygtfs which only writes to sqlite, the only solution would be to rewrite pygtfs and take on its maintenance. That would make the 'new-pygtfs' proprietary for this purpose... the to-be-invested effort does not way up to any presently known benefit. So, closing this issue until I receive a proper business/use-case

vingerha commented 5 months ago

Adding further findings.... when changing DB also the query-strings need to change as they are written for sqlite...

FabienD74 commented 5 months ago

yes. indeed . I'm currenty on it. It requires a dashboard to maintain

"datasources" (zip)
target DB ( connection strings)
links between "sensor id" and target DB ( so a sensor can be switched from one DB to another.. ( db_test, db_prod, active_db, db_test_import ... )

I'm currently on the dashboard.
I have working connections to sqlite and postgres.... a lot of work in the "pipe" :-)

databases must be created by "db adminitrator", manually .... ;-) ;-) I use the GUI of postgres, done in 2 minutes.

examples: for sqllite, "sqlite:///custom_components/gtfs2_pg/master.db" for posgres: "postgresql://gtfs:gtfs@vm-hadb-tst.home/gtfs"

vingerha commented 5 months ago

So...how are you planning to migrate the sqlite to another solution then? and still: why ??? This is not urgent/data-risky/performance-related.... seems just more work to build and maintain. But by all means, if you want to build something then fine of course .... but if it also needs to end up in this repo, then it has to be 100% 'next' to the current solution as I am not willing to spend time on testing/reviewing .... so you'd need to copy all functions / py-files

vingerha commented 5 months ago

Or maybe...this could help too https://pygtfs.readthedocs.io/en/latest/gtfs2db.html

FabienD74 commented 5 months ago

I hope to be able to perform all existing SQL statements... (not yet tested). And i stil don't know how to migrate / import data.. :-))). may be via your link. But many solutions tranform the tables/db during import ( always for good reasons,...) .... and the only reference i have is via pygtfs.... at least it works for me :-)

some solutions are not maintained for years, and GTFS specifications have been changed a few months ago... PS: I have seen some description on how to cope with routes doing "loop", "lasso" ...... ;-) ;-) ;-) ;-)

FabienD74 commented 5 months ago

May be it's already supported by pygtfs, as we can provide "sqlalchemy" connection .... this is what i'm using ;-) ;-) so may be, i just need to tweek the statement: pygtfs.Schedule(...)

...
sched = pygtfs.Schedule(":memory:")
This will create an in memory sqlite database. Instead you can supply a filename to be used for sqlite (such as (‘gtfs.sqlite’), or a sqlalchemy database connection.
...

vingerha commented 5 months ago

the memory option is not really feasible I believe, you have to think of rpi machines too with low resources .... and my largest db is already 1.8Gb in Sqlite form... finland is 9Gb

vingerha commented 5 months ago

And I will be repeating this a lot: what is the gain, still donot see it :)

vingerha commented 5 months ago

I would be more interested via the link to see of that would allow to combine sources into one db (schedule)...first time I saw that one as part of pygtfs

FabienD74 commented 5 months ago

For me ,

keep HA small ( mine is nearly full) ... may be because i'm using ESPHome ( each device has it's own environement to be compiled ??? ... that could very well be !!! )
have tools to watch data, execute queries, (right now, i'm blind), i cannot follow what queries are doing if i don't see data in tables,
create views using a nice gui to track issues and make life easier...

Yesterday i saw a bus departure that was not listed on the official website. I cannot fix it !

vingerha commented 5 months ago

You donot need ESPHome stuff in your HA config, I have this in a separate folder...this indeed is massive when needed for building
I agree sqlite sucks in many ways too but well.. this is not really a critical db
not sure what that would be, I stopped using stuff outside of HA, had grafana too but it only looked nice, I never used it for a real purpose.
And on the missing stop...join the club, complained 2 yrs ago, no action from Zou

On the gtfs2db...it seems to work fine...just merged a zipfile into an existing sqlite (lol) but this was a simple test to see if technicaly fine. I will try to spend a bit more time on that soon as this would really help me (!...so I have an incentive) If you have an option to create mutliple schedules/db then the only thing that requires copying are the queries and their parametrised input, or a function per db-type which would make it easier to tune in case of issues with one db-type. Since HA people often use mysql, maybe an intial implementation may be that ?

vingerha commented 5 months ago

Already found the first challenges, myself as a use-case

My 4 sources have different update frequencies, at a minimum I would then need to download all 4 each week and combine them even if 3 of them have a monthly refresh
when anything goes wrong, the whole thing has to restart, so more code to keep/delete/update/check and at the moment it is already challenging
a combined file/db would be huge so not sure about performance...also noting that for me the sources from CH (1.7) and NL (1.8) I do not need often, the db-file would be 10-20 times bigger than the db which is used daily

The benefits are however good: I would only have one setup for 'local stops' instead of 4, this is particularly nice with my travel

So, my suggestion (do comment where you want) is to start with some functionality..aside all existing, that will allow to

create a 'main' db from one of the existing ones and delete the existing one
add other source to 'main', each time deleting that source
add a service call to delete selected db or 'anything' if none selected (reset to 0)

Downside is still that this is quite a challenge to automate end-2-end, added complexity for a periodical basis. So I think to start this via a service-call setup and then let HA do the auto-stuff

vingerha commented 5 months ago

Reading above, I am not sure how the hell I would explain the config to a user LOL. But hte vast majority will only need 1 source I assume

FabienD74 commented 5 months ago

AMAZING, IT WORKS OUT OF THE BOX !!!!!!!!!!!!!!!

it's currently uploading .... structure is there

Let's wait and see the result ... I see HA cpu burning at 24% ( donno why it's not using 4 cpu). VM running my DB went from 0.5% CPU to 4%

I can't wait.... ;-)

vingerha commented 5 months ago

Looks good but (sorry to dampen the spirit) look at the queries I have, they contain sqlite-specific stuff and I know mysql also has its oddities vs. postgres. I am not sure if it exists but it would be nice to have a db-independent layer on top which would take care of them

vingerha commented 5 months ago

correction.. just a few are specific

collect on static stops
collect local stops
index creation maybe (?)

vingerha commented 5 months ago

Please also give the process a bit of thought. Should people be choosing a db-type with each source? or is it one db-type and then add sources Would each source end up in a separate db? As per my other topic, there may be a good reason to not put all-in-1. Deleting things, for sqlite it is juste delelte-file, for mysql not that simple etc.

vingerha commented 5 months ago

You maybe get the gist on why I am careful, I spent more time on the data administration then on the content

vingerha commented 5 months ago

If you can make the amdin stuff work, I am OK to help with the data part, queries and all

FabienD74 commented 5 months ago

Well, currently, i'm waiting. We can have multipe feed ( +/- "zip") inside the same DB... field feed is on primary key on all tables, ther is a table named feed also).

I'm waiting DB import ..... looks very slow it it does something.... It is inserting "stops" ..... but table remains empty.... ( big sql commit at the end ????) wait and see

vingerha commented 5 months ago

Would be a hoot if sqlite proves to be 'fast' :)

FabienD74 commented 5 months ago

+/- same speed to upload .... I have all tables uploaded. All tables contains the feed .... which is a good and bad thing. That's great but unexpected ;-) ;-)

vingerha commented 5 months ago

... to have chatty discussions... again : try to ping me on Discord, this thread now goes very long (not a real problem though) And yes, one can in fact copy the whole library locally into gtfs2 ( I already have bits from others ) but that also means more maintenance and you donot want to know how much time I spend on people where the current version works 97% and I again need to tweak queries due to their slightly off-normal (?) datasource.. The result is/was indeed that I am now adding indexes and removing files myself. The lib-owner is not negative towards my requests and find it OK but nothing much is happening. So, if (!) you want to change it then maybe you should make a 'fork' out of it and then I am OK to embed that in gtfs2.

On my own night of revelations, I am less OK now with the combined datasources as I cannot find an easy solution to also combine the real-time bits. I did migrate all to json so I could combine them into one json but it is also the timing of things as a process needs to run to and extract all (say in my case) 4 sources, then combine them and at the same time not obstruct the collection by the entities.

vingerha / gtfs2

[FEATURE]: Use a remote Database, using a "connection string" like in HA #70