Bug: Large Files Seem to Parse for Forever

jerviscui commented 2 years ago

I tried to format 50k text using Linter and the program got stuck. Has anyone tried formatting large files?

In fact I have 150k of large text.

pjkaufman commented 2 years ago

I am not sure. It takes about 15 minutes to go through 2 to 3 thousand files for me. Are you letting it go through and try to finish?

Do note that the larger the file, the longer some of the rules will take because parsing is not inexpensive. It takes a while to parse some things like tables, so a large file will take a lot longer to parse.

jerviscui commented 2 years ago

I can't wait for it to finish executing, it's too long.

I tried a 3000-line file in VS Code, using the plugin https://github.com/DavidAnson/vscode-markdownlint for formatting, and it took less than a minute.

pjkaufman commented 2 years ago

Did you try your 150k line file with the plugin you referenced?

I know the linter can have performance improvements, but as of right now the linter has to parse the file to know when to apply certain changes. This can be speed up with certain optimizations, but for 150k lines, it will take some time.

What rules do you have enabled?

pjkaufman commented 2 years ago

Also, doesn't a 3k line file take less than a minute for the linter as well?

I know that I want to cache the parsing results for a file, but it feels like it would not help on really large files as I would have to reparse the file each time there is a change to the file.

pjkaufman commented 2 years ago

Also, if you provide the large file, when I get to caching optimizations, I can see how much of a speed up there is.

jerviscui commented 2 years ago

Did you try your 150k line file with the plugin you referenced?

I know the linter can have performance improvements, but as of right now the linter has to parse the file to know when to apply certain changes. This can be speed up with certain optimizations, but for 150k lines, it will take some time.

What rules do you have enabled?

I meant 150kb.

jerviscui commented 2 years ago

I tried a 3000-line file in VS Code, using the plugin DavidAnson/vscode-markdownlint for formatting, and it took less than a minute.

In fact, it's quick, much less than a minute. I'm just trying to be euphemistic.

I had to split the large file into multiple smaller files because of the formatting speed. You can try this file, although he's not strictly markdown. You can intercept 150kb for testing purposes. 新建文本文档.md

I created front meat and h1 via linter. I automatically trigger the linter when I save. Thank you very much for the plugin!

jerviscui commented 2 years ago

This is my settings.

data.txt

grantx2016 commented 2 years ago

Linter hangs when attempting to lint the attached markdown file (Note: Linter is configured to only add front matter at this stage - nothing else). No error is show in the console so I only found this was the troublesome file by trial and error which is not ideal. Hopefully this issue can be resolved soon as the rest of Linter has been terrific so far. 2021-07-05 Google Product Taxonomy.md

pjkaufman commented 2 years ago

Linter hangs when attempting to lint the attached markdown file (Note: Linter is configured to only add front matter at this stage - nothing else). No error is show in the console so I only found this was the troublesome file by trial and error which is not ideal. Hopefully this issue can be resolved soon as the rest of Linter has been terrific so far. 2021-07-05 Google Product Taxonomy.md

I see the hanging and it is likely due to parsing the file, but I am not entirely sure. Have you tried swapped the log level in the data.json for the plugin to include all output (i.e. setting the value to 0)?

Without the data.json it is hard to tell what causes it to hang in your case.

pjkaufman commented 2 years ago

Here is a link to a discussion that was had around better handling of caching which I would like to be able to get around to implementing for better performance all around: https://github.com/blacksmithgu/obsidian-dataview/discussions/1501

pjkaufman commented 2 years ago

I tried a 3000-line file in VS Code, using the plugin DavidAnson/vscode-markdownlint for formatting, and it took less than a minute.

In fact, it's quick, much less than a minute. I'm just trying to be euphemistic.

I had to split the large file into multiple smaller files because of the formatting speed. You can try this file, although he's not strictly markdown. You can intercept 150kb for testing purposes. 新建文本文档.md

I created front meat and h1 via linter. I automatically trigger the linter when I save. Thank you very much for the plugin!

It seems that for some odd reason it stalls out. I am not yet sure what the cause of this is. I have not even gotten it to tell me what the actual rules it is running are which tells me that likely the thread is continuously running or somehow parsing has hit an infinite loop. It could be that the parser does not play well with Chinese (this is my guess as to the language found in the document). I am still taking a look to see if it ever finishes running. However I am not sure what the actual issue is at this time.

pjkaufman commented 2 years ago

But the issue with the initial file reported does seem to be bug, but I will have to tinker some more to see if I can find any more info on what is going on.

pjkaufman commented 2 years ago

I found a slight bug unrelated to this that I need to fix. Once that is done, I can go ahead and push the changes up and get a version release.

grantx2016 commented 2 years ago

Linter hangs when attempting to lint the attached markdown file (Note: Linter is configured to only add front matter at this stage - nothing else). No error is show in the console so I only found this was the troublesome file by trial and error which is not ideal. Hopefully this issue can be resolved soon as the rest of Linter has been terrific so far. 2021-07-05 Google Product Taxonomy.md

I see the hanging and it is likely due to parsing the file, but I am not entirely sure. Have you tried swapped the log level in the data.json for the plugin to include all output (i.e. setting the value to 0)?

Without the data.json it is hard to tell what causes it to hang in your case.

@pjkaufman attached is my data.json file as requested.

Also, I changed the log setting as suggested but don't know where to find the log to send you. The console still shows nothing. Here is the trace. There is nothing more shown. Just hung again necessitating a quit. Screen Shot 2022-10-28 at 9 08 55 am data.json.zip

pjkaufman commented 2 years ago

It looks like if nothing is getting run then it is stalling out somewhere. I will go ahead and see what happens when I run it locally via the tests that I have and see if I can get it to run or if it just stalls out. The parser seems to work, so I am guessing there is a bug in the logic somewhere that I have yet to find. Hopefully I will have more time to investigate and get to the bottom of this soon.

pjkaufman commented 2 years ago

Looks like it works just fine until it gets to the yaml rule for lint title (I am not too sure all the rules it has problems with), but it hangs on that rule. I am working on figuring out why.

pjkaufman commented 2 years ago

Looks like it is any level of parsing in the app or in the tests. It may be operator error on my end in how I wrote the code since it seems to work just fine when not done from inside the code I have written. But I will have to come back to this at a later point.

pjkaufman commented 2 years ago

Here is the line I cannot seem to get past when using console logs to help debug the file: https://github.com/platers/obsidian-linter/blob/master/src/utils/mdast.ts#L47

If someone else has any ideas what might be causing the issue that would be helpful as well. I am thinking the next step is to try this out in its own npm package and see if the same thing just stalls indefinitely. I am adding this to the board queue.

pjkaufman commented 2 years ago

I have a test example setup to run all night if need be with it being timed. If it finishes by morning, I should know. If it takes a really long time or does not finish at all I will push what I have found to the creator of those packages and see if they have any feedback on things to try to avoid it stalling out. I am aware that trying to do a table parse is slow based on some of the info provided, but whatever seems to be going on with it stalling for 15+ minutes is definitely absurd. I am really hoping this is just an operator error on my end.

pjkaufman commented 2 years ago

It looks like there is an issue with the github flavored parsing for the Google taxonomy file as it took something like 78 minutes to parse:

real    33m52.407s
user    77m47.900s
sys     12m18.686s

Without parsing and factoring in github it takes less than 5 seconds to parse:

real    0m0.661s
user    0m1.113s
sys     0m0.075s

pjkaufman commented 2 years ago

This seems to be an upstream issue, but I may be able ti mitigate it, by creating a couple of workarounds. The problem is they are either hacky or can cause problems:

only parse with github when needed (this will hit the same issue when the parsing is required)
not using the AST for identifying tables which can have missed tables or invalid identifications, but should be faster by using regex like so
if a file has more than 200 lines in the same paragraph (this comes from https://github.com/micromark/micromark-extension-gfm-table/issues/9) then just do not run rules that would require parsing a table and just log a warning and display the warning to the user

What are you thoughts on these hacky solutions?

pjkaufman commented 2 years ago

I am currently leaning towards option 2 as I used that in the past. I have something somewhat ready, but I would need to actually verify that it runs in a reasonable amount of time. I am timing it, but it seems more performance changes will be needed to get it running better. It seems to still stall. I have it timing the run, so maybe it won't be as bad, but it seems that large paragraphs are not handled well by the Linter (i.e. lots of text and lines with no empty lines between them).

pjkaufman commented 2 years ago

Sorry. I missed that lint on save was turned off. So I can actually see that the issue is resolved by the change to no longer use table parsing directly. This may cause some bugs around table formatting with spacing. But I guess that is the price that will have to be paid in order to tackle this issue.

pjkaufman commented 2 years ago

It looks like running the Linter took a couple of seconds for the whole lint of the Google taxonomy file. I am now checking the other file mentioned here.

pjkaufman commented 2 years ago

For the data.json provided for the original issue, it took less than 1 minute. I would speculate the whole thing took around 20 or 30 seconds at most. I still need to fix a couple things up, but it should make things more performant for larger files. I will have to add more performance improvements later down the road (hopefully in the next couple of months).

pjkaufman commented 2 years ago

The issue was caused by a dependency, so hopefully this change will speed things up. I just need to fix a couple of bugs created by the change and then I will push it up and do a release barring anything else popping up as an issue.

jerviscui commented 2 years ago

Yes, at least using a fuse logic is better than waiting endlessly, and in most cases linter is working properly.

pjkaufman commented 2 years ago

This should be taken care of now on master. I will push out a release soon. If you would like, feel free to the master branch in the meantime. Please let us know if the issue is not resolved on master or the next release.

platers / obsidian-linter

Bug: Large Files Seem to Parse for Forever #458