Open svenseeberg opened 10 months ago
I think a lot of the things can be parallelised. The creation of users can run in parallel, as well as rooms and non-threaded, non-status events (messages, but events might cause problems).
I haven't used a multi-process library for quite some time, but maybe even a bit more (but still limited) async in single process might change a lot, as I suspect that the HTTP requests to synapse are the bottleneck for the most part.
Do you have any test data (maybe a dump of a test server that has been used a lot for testing) which I can use to test the performance? Or maybe some tips on how to write a script to generate some test data?
I am not a programmer, so take my opinion with a grain of salt :)
Maybe I am overlooking something but shouldn't it be possible, to just split up the userdata when exporting the DB in roughly equal parts (~1/n) and then just run the script with the magic of GNU parallels (https://www.gnu.org/software/parallel/)?
Do you have any test data (maybe a dump of a test server that has been used a lot for testing) which I can use to test the performance?
As Rocket.Chat wants to force me to do things I don't want when I set up a test instance, I don't have one and rely on data from a real instance. However, I'm currently working on some test data which wouldn't be enough to test performance, though.
split up the userdata when exporting the DB in roughly equal parts (~1/n) and then just run the script with the magic of GNU parallels
It's not that easy because of some cross-references, unfortunately. So we need to have all users and rooms before creating the messages, so we'd need to wait for each step in each process before proceeding (something like a semaphore). Messages in threads are skipped if their root isn't mapped, so they need to be handled in some sequence or the data split accordingly.
Finally and most importantly, we don't know yet, where the bottleneck is. It would probably reduce the overall runtime if entities of the same type are handled in a more concurrent manner, but maybe the HTTP requests to Synapse are the slow action, rendering such a change useless. And we're using a SQLite DB to save the mappings, I don't know how parallel you can read and write in these.
We started working on a script that generates test data: https://git.fairkom.net/hosting/chat/rocketchatmatrixmigration/
We don't yet generate a proper random timestamp, but that would be an important next step.
We tried testing this script and tried understanding what might be the bottleneck, but for now we have more questions than answers so we will continue trying to understand this, but we just wanted to share this work in progress script.
Hi everyone, As said before, migration's performance is a big issue. I also think there are 2 bottlenecks one on this tool side and the other on synapse side. On synapse side I added some workers to help main process treat events: stream_writers: https://matrix-org.github.io/synapse/latest/workers.html#worker-configuration On this side I divided messages in batchs to treat them in parallel. I don't mix messages from one room in different batchs in order to keep their order. My work around is: first handle users and rooms then divide messages in batch and treat them in parallel, i found 2 ways:
I never worked with npm before this project, don't know if it can have an impact on performance. Could be easier to handle parallelization with a go program.
Sounds great! Could we help you test this in any way?
Great stuff I'm reading here!
@chagai95 That script really looks helpful to test performance. For edge cases we can use the manually added test data (I mentioned the commit f99771b2803ccc49dae3c28e280fadee96adb212 adding it, somewhere), so you don't need to worry about that in the script.
Have you considered using Faker to generate the random data resembling the purpose, instead of random strings? It provides a lot of providers to generate names, texts, dates and a lot of other stuff. It can even wrap it in JSON conveniently.
@grvn-ht, you said:
first handle users and rooms then divide messages in batch and treat them in parallel
I think that's a sane approach, the number of messages should be magnitudes larger than the rest (my assumption for instances so big, that they need a higher performance).
I could help with some jq
magic to split the messages, but grep could do the trick, as well. And I think I need to implement the parts that read configurable files, anyhow.
The question I didn't spend much time on, yet, is, whether there is a convenient and performant way of parallelisation within the app, that would make it unnecessary to adapt the db. The storage adapter/TypeORM allows us to use another db for parallelised access, nonetheless.
Out of curiosity I ask you all naively why you need a significantly lower execution time for such an one-off migration? Can you name any numbers, yet? Are you intending to migrate different chats regularly? What's the story?
Why do we need some performance? We have to migrate at least two large RC instances with 10k+ users each. We simply need to avoid a downtime of weeks, maybe a maximum of 3 days (weekend in summer) is acceptable to run the migration script. Unless we find some mechansim like with rsync, that allows to re-sync only what has changed. So if that would be possible, we could stop registrating new users in RC, start syncing, and after a week or however how long it takes we fetch again only the newest messages to have them all in matrix.
Out of curiosity I ask you all naively why you need a significantly lower execution time for such an one-off migration? Can you name any numbers, yet? Are you intending to migrate different chats regularly? What's the story?
you're right, I only need to do it twice, one for a small RC servers: 500 users, 3000 rooms, 120 000 messages for this one i could use this project as is. On the other hand the second one is much bigger: 1000 users, 5000 rooms, 4 500 000 messages tell me if you found other values but after some test i remember being able to process 4 messages per second on one instance. for this server it would leads to 12 migration's days. With parallelization I could do it in ~ 2days.
like @rasos I would like to be able to run migration script on a week-end to avoid synchronization issues between RC and Matrix. But it's true that I could also do main migration and then migrate onother time with fresh data on same db.sqlite.
But I found interresting to try improve performances ^^
4 messages per second is pretty slow indeed, I understand the need.
My current approach is to let the script run multiple times (on the same DB, with different inputs). This works mostly fine as far as I can tell, but it doesn't detect any changes to already processed (and thus mapped) entities. Thus I'm not entirely happy with this solution, which is more like a crash-resistant design.
I experimented a bit with handling multiple rooms and messages concurrently, but I can't check the results for correctness for a lack of tests. I suspect that it misses most threaded messages. Anyhow, the times are significantly reduced and vary depending on the number of parallel entities. If anyone wants to have a look: https://git.verdigado.com/NB-Public/rocketchat2matrix/pulls/101
Now I've implemented concurrency with the aforementioned PR (or commits b48400af94dcf28c9161c59def38c803604bb088 and b48400af94dcf28c9161c59def38c803604bb088) after testing it with end to end tests and fixing some bugs.
I would really appreciate your feedback on it, if you could test it. I didn't see any problems testing with our database, but maybe I missed some points ;-)
Hi, i run updated app, rooms are working well but when it tries to import messages, I got error: the same situation is for limit set to 1, I have very big data set ~ 5GB of messages in json file :)
Oh my, of course there has to be a problem with my simple approach. Thank you for reporting this!
So, I interpret this as the queue getting too long. My first thought would be to read the file and enqueue the handling jobs depending on the queue.
Or maybe another approach which uses more CPU cores :shrug:
Now I've implemented concurrency with the aforementioned PR (or commits b48400a and b48400a) after testing it with end to end tests and fixing some bugs.
I would really appreciate your feedback on it, if you could test it. I didn't see any problems testing with our database, but maybe I missed some points ;-)
As raised in https://github.com/verdigado/rocketchat2matrix/issues/9#issuecomment-2247352373, concurrency in messages messes with the ordering, as explained in the WARNING of https://spec.matrix.org/v1.11/application-service-api/#timestamp-massaging. I don't know how it would be possible to implement concurrency of different rooms while removing concurrency for messages within a room/thread.
Problems with the message order are also raised in #22. Thanks for mentioning the warning.
For now I reverted the concurrency in 698062c7f15e3786320e20066bbf0aee05f8945b to allow a functioning migration.
I don't know how it would be possible to implement concurrency of different rooms while removing concurrency for messages within a room/thread.
I would also create a queue for each room, handling these messages sequentially and handling the rooms in parallel, as you mention. Maybe with a different library like Promise Pool.
Problems with the message order are also raised in #22. Thanks for mentioning the warning.
Indeed, I missed that!
I would also create a queue for each room, handling these messages sequentially and handling the rooms in parallel, as you mention. Maybe with a different library like Promise Pool.
I can try to tackle that tomorrow, but I'm no typescript expert!
I implemented the sequential messages per room concurrency in #31. I generated RC data with 100 rooms, each holding 20 messages, to compare with and without concurrency. It got me with 2:46s for a conversion without concurrency and 1:46s with the concurrency. This obviously needs more testing, but anyways I'm not sure we can improve performance significantly more, as this is mostly useful for when there are many rooms.
Currently, the migration into the Matrix server is a single process because data needs to be imported in a chronological order. We should investigate how performance can be increased with multi-threading or processing.