Open TheShadowOfHassen opened 1 year ago
I have already implemented a solution to this honestly. We compress files into an archive similar as with current Manuskript but if you use it, Manuskript will extract it into a temporary folder.
Temporary folders will be mapped into memory on quite a few Linux systems making it faster. Also because the folder uses the same structure as it's used for storing a project as directory, we don't have additional branching in the code. Additionally it will work with Git since it is actually a folder even though it's temporary.
The only thing missing is the Git integration. But that doesn't have priority at the moment.
I didn't know that. That would work. Do you have the folder compression part already written? A lot of these ideas I was toying with one of my own project and I discarded this exact Idea because I didn't think I could figure out how to compress everything.
The Option, where I can save a Project ain the File-system and not as a single File should be always possible!
My biggest Project today has about 23 MB in the Manusskript-Folder and about 3700 Scene/MD files!
I would like to have the possibility to make zipped Backups out of Manusskript, but I don't think it's a Good Idea for something of this size handled as a monolithic file!
By the Way, this Project is still growing. I publish the Story weekly, Chapter wise on an Online-Platform! More than 200 chapters are published!
Honestly backups I think we should leave to the user because with the zipped .msk its basically a file format like an .odt or .doc. We could add a reminder however.
Just curious what is this project you're working on? (How did you get it so big?)
Just curious what is this project you're working on? (How did you get it so big?)
It's a long time project, which was from the start planned more as Universe.
From the stile you can see it as an adult Light-Novel (Japanese entertainment literature). I publish it on Patreon and have a little following there, which like this monster;)
OK cool. Is there any problems you have with the fact that its so big? (Things that could be added in the gtk port to make your life easier?)
two things:
@obw I would be interested in how the GTK port performs on your project already to be honest. Maybe you can try to checkout the current branch and try to open it. You can adjust the path
in test_io.py
(here) to open your project instead of the sample.
It's not fully optimized yet since I haven't made changes to utilize multi-threading or load-on-demand but to my own projects it already had quite an impact.
@TheJackiMonster, I will try it later, I'm busy at the moment with some other things. I've round about 200 pages from my editor to back, and now I use a graphical DIFF-Tool to merge the corrections. This will cost some time ;). It's not the Monster, another universe of mine, also a rather short thing, round about 3k pages, main story and additional stories.
For this Project I use DOKU-Wiki as storage engine. The writing happens partly in VIM or LibreOffice, also as Markdown!
I think I will test it tomorrow, after I have made a copy for the test!
Something about performance, when I start the 0.12 Version over SSH on a different machine, over network, it's a little bit faster:
Project loading time
Both cold disk cache!
With a prefilled Cache round about 30% faster!
But the overall performance, lag, is better when running on the same machine!
I think, it's all the Stuff I use parallel on the Desktop, like MP3-Player, LangTool and so on! When I go over SSH, I have no KDE running on the Desktop machine!
I had some Time! And WOW that was fast!
Not only the loading of the Project was fast (9 Seconds), also the lag I normally face, when switching between Chapters, was nearly not existing!
I will not use this version at the moment, because I do not know, how stable it is. Mostly thinking about loss of Data!
Good work and thanks!
@obw Thanks for the feedback. Data loss shouldn't be an issue with the GTK port. But it's far from having feature parity yet. Also wiring up changes from the UI to the data requires more work than in the Qt version but it comes with the advantage of lower overhead.
Like I said it's not fully optimized yet. Currently it loads the full project at once instead of loading chapters/scenes on demand which is the goal in the future. So that explains why there is no lag between switching chapters (everything is already in memory and ready to use). But I think it's a good start cutting loading times in half. ^^
By the way I've now generated a custom project with lorem ipsum filled text files for testing. It's close to 1GB big and about 50k different files in outline. The IO is actually not the only limiting factor, it seems and I will need to test whether multi-threading brings huge improvements. Currently I assume it would make most sense to adjust all data structures to allow async operations for loading and saving. But I'm not sure whether that will be compatible with the UI easily. I think GTK3 is designed to run single-threaded. So maybe such changes get delayed until porting GTK4.
I think here is the better way, to analyze where we could the most impact, with the least work.
From what I have read and have experience with Manuskript, I think as always in development, when we found the right places to tweak the result will be better than the needed Work suggest!
My suggestion is at a part profiling, to find the places where the code use up the most time. Some places we know without this.
Also, I think, the asynchronous operations should be implemented for some small parts.
The entire statistics part, could be asynchronous. The magic is that the GUI that must not know;). Just a simple eventhread, which is informed, to recalc and what. When done, trigger a redraw and go to sleep, until new work arise.
@obw I've added caching for calculating the word count now as well as accessing parent outline items. Also calculating sums of differents counts for folders in the outline tree should be a little faster. My 1GB test-project loads in about 18 seconds now...
It also seems like most of the time loading, the application is actually processing word counts for calculating goal ratios in the outline view. So if that can be done asynchronous as you suggested, the whole thing should be faster and more responsive.
Multi-processing and multi-threading does not benefit as much because of the typical python implementation not providing actual multi-threading on multiple threads. So multi-processing would be a suggested option but sharing memory between multiple processes causes a lot of overhead which is not worth it in my opinion. I also tried making use of coroutines for the loading, so files could be processed out of order but it doesn't improve anything either.
@TheJackiMonster thanks for the Update, as told my python knowledge is rudimentary, at best. My preferred Language is PHP;).
18 Seconds for a 1GB Project is near perfect. Try to open a 500k pages Text file with a Wordprozessor like MS-Word or LibreOffice.
I have written one Project with LibreOffice in one file, when I reached about 600 pages, it becomes unbearable slow! Loading round about a Minute and another one, when you want to edit something in middle!
It was one of the reasons, why I have started to use this tool!
P.S.: Is there an easy way to start a profiler for Python? I could run it while my normal editing Session, perhaps we can find so some parts, where we lose time! P.S.S.: This is something what some User should do, because every person have his own way ti do something!
Is there an easy way to start a profiler for Python? I could run it while my normal editing Session, perhaps we can find so some parts, where we lose time!
Technically we could profile several methods during runtime. However measuring and logging timings will add latency as well. So I'm unsure whether that should be added to a final release even though it's useful for debugging.
But maybe we can add automatic test cases which profile performance in a way.
Technically we could profile several methods during runtime. However measuring and logging timings will add latency as well. So I'm unsure whether that should be added to a final release even though it's useful for debugging.
I think, we should use external tools for profiling, the hole thing. So that we can find the really significant parts for optimizing!
I have made the experience with own project, it's sometimes really strange, where performance is lost. I had in one project an 40% improvement, after I simply switched from a foreach
loop, to a for($i = 1; $i <= 10; $i++)
loop and some simple improvements of the data structures. Without full profiling, I would never found this possibility!
But maybe we can add automatic test cases which profile performance in a way.
@TheJackiMonster this will be good, but before that, we must know where to look, this is my thinking behind this Idea.
I've now added a function to do profiling during runtime. At least in current state it should be fine since the GTK port is still in development. Also I've implemented the loading to complete in an idle task, updating the outline and editor view on completion.
The 1GB project will now open in less than 8 seconds. I think it will be very beneficial to abstract this method further to make UI more responsive to background tasks in general. The advantage with GTK is definitely that any function called on idle state will not cause any concurrency issues because it's scheduled only in the main thread.
The profiling will also show that the initial loading of the project takes max 2 seconds while UI creating the tree models for the outline data will take about 6 seconds. So the UI is actually the bottleneck at the moment. ^^'
So I looked at the code the loading in Gtk and it looks really good. I learned a lot by just looking at it. I have one question. Could we change the staging area for the unzipped .msk to not a temporary file? Temporary files are meant to be deleted however if manuskript would crash it would be smarter if its somewhere where we can accesses and restore the file.
Temporary files are meant to be deleted however if manuskript would crash it would be smarter if its somewhere where we can accesses and restore the file.
I think that should depend on the configuration/settings. We don't want to duplicate storage requirements if it's not necessary. Because I would assume that adding auto-save later on will prevent loosing too much data anyway. Because when saving the project, it will create a new archive and replace the old one. So if that gets triggered automatically in a few minute period, it should be rather save.
Another thing to implement is the new revision system. So if we are using Git or Svn, we can simply sync all changes from the temporary directory to any remote as well. So users can pick any Git remote service for example or sync to a local remote in their own network. But I assume backing up changes on another device is way better than making another local copy.
Maybe we can implement a local copy to a user-selected directory as a rather simple revision system. I think it would make sense to implement revisions in a modular way like the spell checking. This goes in a similar direction as with the idea to allow plugins as well, I assume. So having a local copy from the latest state available as potential revision could be a pretty simplistic implementation of such a revision system.
I thought it would do something like delete the file once the program is saved. Whenever manuskript starts it would check the folder and see whether there is a file there and if so ask to restore it.
So lets say you have a .manuskript file in the home directory. When manuskript starts it'd check .manuskript/temp for a project. If there isn't one it just goes about it's normal job. If it does it asks if you want to restore it and if not deletes it. When Manuskript opens a file it uses temp as the temporary file and when you save/ close it deletes the content of temp. If temp isn't deleted it crashed but the files are saved.
That said I have already been looking into a git revision system and that's another reason I wanted to do this. I can't find temporary file path to initalize the git system.
Plugins for different revisions is a good Idea also. We just really need three parts, an initializer, something that makes the commits (Using git as an example) and a viewer/ restorer.
I think one problem the method storing only one file for Manuskript as restore point is that users could open multiple instances. Also they could crash with one project and startup the application using a different project as start parameter to open. So I think it's best to keep track of such a state with context of the project to mitigate that.
That said I have already been looking into a git revision system and that's another reason I wanted to do this. I can't find temporary file path to initalize the git system.
The Project
class uses a file
property representing the .msk file. This will provide the path of the temporary (or actual directory) via its property directoryPath
. You can see it's used in the Project
constructor to initialize all other data components with that path. So they load and save their content relative to it.
That's easy. In the hypothetical .manuskript/tmp with foiders for each project in it. you can also have a .json files with a dictionary of unique names for the projects and their location. I already wrote code to check for duplicates in dictionary. We might want to set this up anyway so only one instance of a project can be open at a time otherwise there might be problems.
That's easy.
Trust me, even though we could manage multiple restoration points for multiple projects in parallel, it won't be easy. We have to manage parallel file access from multiple processes implying that it will likely require inter-process-communication via shared memory for example. ^^'
The whole reason why I would opt for issues like this later when implementing a proper revision system is that we might solve multiple issues at once with this reducing code, meaning less burden of maintenance and less edge cases to handle (decreasing complexity of the application).
For example if we implement a system being able to handle simplistic revisions, we can provide a backup on crashes as well as revisions without extra dependencies like Svn or Git. Also if we make backup funcionality part of the revision system, we can utilize Git and Svn for that syncing latest changes from remote automatically... it will also reduce conflicts when changes in your Git branch and the backup files won't merge easily. So we can completely avoid merging issues between different kinds of file/changes management.
I don't understand how it'd be this complicated. I don't understand how multiple process have anything to do with instead of having Manuskript changing a temporary file. Have it modifying a "Temporary file" Somewhere where it won't be deleted (Namely our own set up in a .manuskript folder somewhere. Add also a json "Open files files" with checking for no duplicate files opening.
Are we not understanding each other? I feel like we're talking about different things.
I'm talking about multiple Manuskript instances running at once potentially creating different states for backups of the same project or different projects. I've experienced issues like this with other applications creating states to jump back to when the application crashed. So I'd like to implement something like this in a very reliable way.
I'm talking about changing the location of the file which Manuksript uses as its "Temporary file" to edit in the cool streaming feature. If we put it in a place the system won't automatically delete then we can recover from that exactly where manuksript last did anything.
It won't be hard. If you pull my request I can implement it and show you(hopefully) by the end of the week. (I don't want to mess with branches quite yet if you don't want to merge my request yet then let me know what else it needs)
The reason I don't want to change the location where compressed projects are temporarily stored at is that the whole point of changing the IO logic of Manuskript was making it more stable. In current implementation there can not be any archive created by Manuskript which is corrupt since all of the files are stored in a unique directory, the actual archive will only be replaced after the compression finishes and if Manuskript crashes during this process, temporary invalid states are lost.
So everything in this implementation is intentional. I don't think it currently needs any changes in that regard since it works pretty reliable with all projects I tested it with and it's compatible with the older way of handling projects while reducing complexity heavily.
Yes, backups in case of crashes are planned. Yes, a revision system is planned. But given that it's also planned to reduce latency by handling minor changes in files instead of making full project saves all the time as before... there's no priority in a backup system. The plan is to make it not crash fixing issues at the moment, not to add exception handling in case of issues appearing.
Alright. It was to my understanding that the gtk port opened a msk file, extracted to a temporary file and then edited that and that how the design was moving for. I.E. Oh someone opened Chapter One I'm going to load it because I don't hold all 100 chapters in memory because that's slow. The files are saved to the temporary file and then when its done zipped as the save. Is that right?
I want to make this sure. I want to make sure because I still don't think we're on the same page. (Sorry if I'm being difficult but I still don't understand)
If you use it with projects as directories (which is the recommended way of storing your projects), it will access each file from disk when needed. Changes are stored to the files when saved (if automatic saving is implemented, that's pretty much immediately after changes are made in memory).
If you use it with single-file projects, it will extract your project into a temporary directory. All files will be in this directory and can be accessed when needed. Changes are stored to these files when saved and the directory gets compressed to an archive. When the archive is done, it will replace the original archive (the project file) of the project.
Ok I got it. You're right my idea would be a very bad if it was that way.
I was thinking it all wrong. I must have misunderstood the code.
However it might be worth it to at least consider the way I was thinking.
If we had the program extracted into a folder and had manuksript read files when the software needed them and then write them when you were done (Clicked off of the tab or changed working file in editor) It would save. Git and other revisions would be easy, you just set up the git repository in the unzipped folder mark the changes whenever the user changes a file (Using the same call as saving) I have no idea about how you'd set git up if the project was always stored in memory.
Also this way it'd be easier for smaller computers (and phones) that need every bit of memory they have, if Manuskript only opens one file at a time and closes when its done instead of having a 1 GB project in memory after you open enough files. (I know a lot of people don't have that right now but its always good to make performance improvements right?) There also is the thing about if it crashes the data is still in the staging folder. Granted the way you say directory is, it basically is automatically saved anyway, but it could also help with the single-file.
The only problem I can think is that you'd have to work on searching so It'd open each file and search in it separately, also links and references might have to be done that way too (Unless we did a hybrid sort of thing that characters we're written down automatically but story chapters are
I know its a bit of a jump between what already is written but its just a thought. (You'll probably find some problems with it)
I don't think we should worry as much about memory footprint. First of all we are already using Python. So memory management will be far from ideal anyway. That means micro-optimizations are likely not worth it. Second we manage mostly text data which does not take much space at all. Even if we search via regex for a certain text around all content of a project, performance will be in linear relation to the project size. So the only way achieving better performance is either multi-threading which Python is awful at (implementation-wise) from my research so far... or we improve searches via memory and data structures (something like caching results or building tree-structures).
At the moment I'd say we should have a search first before optimizing. It is likely not even an issue since most of the time a user is writing, not searching. So when you think "it's always good to make performance improvements", I would say it depends... the more time we spend optimizing existing functionality, the less time we have to spend on adding new features. That is why optimization should relate to how important the given functionality is for the software.
I suppose in a way you're right about the memory.
However unless you have an idea on setting up revisions VIA git a different way I think this would be the best way, plus if we did editing of file and folder you'd be able to even get pretty close to a live collaboration using git and some sort of hosting service. (I don't know if that's a feature but I think it'd be a cool one.)
Also I think there's a way you can scheduled Gtk to do things on a different thread. I don't remember what it is called but you keep talking about threading and if it isn't bad programming practice we could use it instead of threading.
Also I think there's a way you can scheduled Gtk to do things on a different thread. I don't remember what it is called but you keep talking about threading and if it isn't bad programming practice we could use it instead of threading.
GTK3 (which we currently use for the refactoring) isn't really supporting multi-threading. I think technically it is possible to use it for other tasks but all of the accesses to UI components should happen on the main thread. However you can add tasks which will be processed when the main thread is idling or doing background tasks only. So that's how we can't to implement asynchronous loading and auto-saving.
Those idle tasks are still running on the main thread and therefore are allowed to make changes to the UI without causing any synchronization issues. But we won't know when they are called exactly. The best we can do is keeping events and tasks pretty light-weight, so nothing blocks other processing too much.
If we then end up with some long processing task still bottle-necking the application, we can probably think about multi-processing. But I don't think we should run into such a thing as long as there are smart solutions available like separating some processing tasks in pieces for example and exporting should run in a separate process via Pandoc anyway later on.
I have basically no idea what you said. I'll take your word for it.
Aside from the threading I'd be interested to know if you know how to set up git in the current data set/ Want to know if I should try to start experimenting with changing manuskript to the way I suggested or if its better the way it is now.
Setting up git or svn in a project should be as simple as initializing the projects directory (or temporary directory of the extracted files) as repository. Then you should be able to use all commands. The .git
folder containing configuration and commit history will just get part of a project when git is used. So it gets part of the archive in single-file-mode for example.
We definitely will need an interface to configure git from inside Manuskript then. Because in single-file-mode your directory to use for git from command line will vary...
Yes, that's what I think git could use, but if the files that are being tracked are not changed until the project is saved there is no point on having revisions that keep track of files only every time it's changed. (You said the changes weren't made to the temporary file until it's saved)
But if changes are not saved immediately to the temporary file there will be nothing for git to track until it's done.
Also streaming from the file allows plugins to do easier. Instead of having to add data to the project class you can just have the plugin make changes to the file directory and saving by itself. For example if I wanted to add a calendar If I was supplied the directory off where it was saved my calendar tab (Plugin) can when you click off it, it'll save to a calendar .json file.
I am willing to try rewriting the existing code to do this, just let me know if you want me to.
But if changes are not saved immediately to the temporary file there will be nothing for git to track until it's done.
Yes, we will likely enforce some kind of auto-saving when using revisions. But I have to do more research on how users would expect revisions to be used. We certainly don't want to create commits for each word changed in a text file. Because Git can scale pretty rough with too many entries in its tree structure. Maybe we can amend or squash changes though.
Also commit messages (for Git and Svn) will need to make sense for users. So they can utilize third-party visualization tools for their revision log as well.
Also streaming from the file allows plugins to do easier.
I disagree. We don't want plugins to scale horribly. So each plugin shouldn't start own file accesses if not necessary because IO calls are much worse for performance and latency than accessing data from memory.
We can likely give plugins access to the abstracted data structures, I've implemented and we can also implement a similar abstraction for revisions which uses modules (for Git, Svn and other) internally. So plugins don't need to know how anything works on lowest level.
Some plugins should have access to the file directory without it some of them wouldn't work, for example if it uses some sort of binary that manuksript doesn't handle, I'm less worried about speed (Because if its loading on demand it still should be pretty fast) but there is the possibility of evil plugins so I was thinking having a list of supported plugins.
I actually have been using git for my big writing projects ever sense you mentioned it in my trash error. I don't know how other people would want it but I know how I would want it to work.
First aside from revisions we're going to need to implement a better undo/redo function. It'll need to be done anyway if we're working towards a gtkwebview for the editor display. It should keep track of the last twenty changes and extend from not only the editor but adding and removing characters files and other things as well.
With that honestly Git should save characters and plots when ever you create a new one or finishing another, with the editor every time you change files or every two minutes (when it saves) Clearly you'd be able to manually add commits.
There should be intellgent commits, like if you were editing Character X then it would say Character X and so on. I don't think keeping track of every word change is necessary. Of course I haven't used other revision programs so I don't know what is expected.
Another thing to note, git keeps track of the differences of a line, which is an entire paragraph in manuksript, I don't have a problem with that, but it could throw a lot of people off.
Honestly I think the order we should work on this though is:
Honestly Git revisions doesn't need to be in the initial manuksript release and we can worry about most of it later, I just wanted to get the file managing figured out in relation to it.
Is there any progress on this @TheShadowOfHassen?
Is there any progress on this @TheShadowOfHassen?
@TheJackiMonster has a specific way he wants the feature done. I do not quite understand his method, however, so I am waiting on him to implement it. Sorry
@TheJackiMonster has a specific way he wants the feature done. I do not quite understand his method, however, so I am waiting on him to implement it. Sorry
No problem, I was just curious about how it was going. Thanks.
Okay, before I say anything. It is assumed that for all feature versions of Manuksript there would be backwards compatibility. If not instant loading by some import settings.
So with the Gtk port we're actually pretty close to needing a save. The character tab has 100% functionality (I think and so does the general and summary and plots) The world just needs drag and drop. The actual editor needs a lot of work but some of that work could actually hinge on what way we want.
I think that the save method needs to have these following criteria:
So currently Manuskript has two options, a file and a folder mode. I like the idea of a compressed zip file with the .msk name, however I don't know if the git revisions will work on a compressed format and also it might be smart to set up some sort of streaming saving/loading thing for Manuksript because if you have 30 text files with 5000 + words they don't need loaded always. I think we don't have this problem with Qt manuksript but you get my point. Also if we're going to do an update with images, python's native Zipfile module has no easy way to copy images into it.
Personally I don't like the folder way currently because it has a folder and a .msk file albiet an empty one. I think that can get confusing for people. "Oh, why is this folder here I'm going to delete it." That would be bad.
So I'll propose two different ways to do saving. If people have other ideas they can add them. Both ways are just uncompressed files. This way the Git would work as well as simple ways to do images and if we wanted any other kind of file. (Maybe this should be a different issue but adding things like a Krita map to Manuksript and be able to press a button to open it in Krita So all your files are in the same place. I don't know if it fits what Manuksript is supposed to be, but it's an idea.)
First is Bibsco's way. I don't know if anyone checks on that project, but they have a folder where they store files for each project. It makes it look modern and clean I thought, butsome people might not like that.
The other idea is to do like an obsidian archive is found and instead of having a file and then a folder have a file or folder and inside it a file to recognize it as a manuskript file. Obsidian has it hidden but Manuksript doesn't need that.