Matlab parsing "seems" to hang if code contains a bunch of "%" consecutively

PekaryGergelyR commented 5 years ago

For example:

classdef Dummy
    %%%%%% Some ill formatted stuff %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    methods
        function functionEscapingArgument(obj)
        end
    end
end

First I thought it is an endless cycle, but eventually it will finish, maybe a warning for this ill formatted code?

joeced commented 5 years ago

Found the problem, a regular expression in MatObject._rename_function_variables seems to be the issue.

PekaryGergelyR commented 5 years ago

Actually the original does not even work as if I add the following docstring % Constructor functionVar functionVar, then it will be rendered as Constructor functionVar funcVar.

I think there is no point creating a regex for this kind of renaming as it could waaaay hard and complicated. Instead there is a hack solution, rename every occureance of function[^\s]+ with a unique id and rename it again after the pygments lexer is finished. This way only the actual function properties will be left for the lexer. What do you think?

PekaryGergelyR commented 5 years ago

Okay I hacked together a solution, so for this Matlab code:

    function obj = ClassWithFunctionVariable(the_functions)       
         % Constructor functionVar functionVar
         % function[
         % function(
         % function;
         % function

         this = ['%', 'functionVar'] % functionVar
         % Determine the name and M-file location of the function handle.
         functionHandleInfo = functions(testFcn);
         self.Name = functionHandleInfo.function;
         if strcmp(functionHandleInfo.type, 'anonymous')
            % Anonymous function handles don't have an M-file location.
            self.Location = '';
         else
            self.Location = functionHandleInfo.file;
         end
      end

I created this:

    @staticmethod
    def _rename_function_variables(code):
        """
        Rename variables in code starting with function*, as Pygments parses
        this incorrectly.

        :param code:
        :type code: str
        :return: Code string without function-variable names.
        """
        key = 'SPHINX_MATLAB_FUNCTION_TAG'
        # Find any function occurence where either:
        # - there is something ahead which is not \s or \t
        # - or there are only \s \t ahead but the following character is not \s or \t
        matcher = re.compile('^(?P<p>(?P<s1>.*?[^\s\t].*?)function|(?P<s2>[\s\t]*)function(?P<e2>[^\s\t]))', flags=re.MULTILINE)
        match = matcher.search(code)
        while match:
            [start, end] = [match.group('s1'), ''] if match.group('s1') else [match.group('s2'), match.group('e2')]
            code = re.sub(re.escape(match.group('p')), '{}{}{}'.format(start, key, end), code)
            match = matcher.search(code)
        return code

And after the tokenization the keys must be replaces again:

        # The word "function*" cannot be used anywhere
        # encrypt function variables
        code = MatObject._rename_function_variables(code)

        tks = list(MatlabLexer().get_tokens(code))  # tokenenize code
        # decrypt function variables
        tks = [(token, value.replace('SPHINX_MATLAB_FUNCTION_TAG', 'function')) for token, value in tks]

Might not be the best solution, but it is one. This also solves my original issue.

joeced commented 5 years ago

Thanks you very much. It looks better than the original solution. I will look into it tomorrow.

joeced commented 5 years ago

I've now investigated a bit more. It looks like I can actually delete the function _rename_function_variables. With the 1a50005ab492415ca99028e24235986111a7f84a commit I changed how the parsing works (in general less likely to crash on invalid MATLAB code). It apparently solved the problem with variables named function*. But I'll test some more, before committing.

PekaryGergelyR commented 5 years ago

That would be nice! However, I do not understand I am originated from that commit and it does not work for me.

I looked into the pygments lexer and just one line has to be modified: (r'^\s*function', Keyword, 'deffunc'), to (r'^\s*function\s', Keyword, 'deffunc'),

I created an issue for it on the bitbucket page: https://bitbucket.org/birkenfeld/pygments-main/issues/1491/matlablexer-cannot-handle-literal-function

joeced commented 5 years ago

Thank you so much for filing the issue, let us cross fingers its taken care of. I'll will still look more into this, to make the file parsing more robust.

PekaryGergelyR commented 5 years ago

Aaaaaand I was wrong again sorry. The "correct" way to fix the pygments is to use the following (r'^\s*function(?=[\s[])', Keyword, 'deffunc') As a function could be in the format of function[a,b,c]=f_no_spaces(x,y,z) and the character must not be consumed.

Eventually this will not solve the problem as the function token will be marked as a Token.Keyword nonetheless. The solution might be to check whether it is followed by a Token.Name.Function or not.

What do you think?

I think an easy solution might be, to check that the function keyword must be preceded by a new line and it must be followed by either \s or [:

    @staticmethod
    def _fix_function_variables(tokens):
        for idx, token in enumerate(tokens):
            temp_token = (token[0], token[1].lstrip())
            if temp_token == (Token.Keyword, 'function'):
                previous_token = tokens[idx-1] if idx > 0 else None
                if previous_token and not previous_token == (Token.Text, '\n'):
                    # This is not a real function as it is not after a new line
                    tokens[idx] = (Token.Name, token[1])
                    continue
                next_token = tokens[idx+1] if idx < len(tokens)-1 else None
                if next_token and not re.match('[\s[]', next_token[1]):
                    # This is not a real function as it is not followed by '\s' or '['
                    tokens[idx] = (Token.Name, token[1])
                    continue
        return tokens

sphinx-contrib / matlabdomain

Matlab parsing "seems" to hang if code contains a bunch of "%" consecutively #85