Snuggletex Integration - Githubissues

Zylence commented 1 year ago

Snuggletex

Snuggletex is a library from the University of Edinburgh for converting latex to XML, but can be used for latex parsing as well. It is extendible, easy to use and powerful, all whilst containing almost no external dependencies. In the future, it could become our main latex parser for integrity checks.

What it does do

^
_
$
commands: \command[opt]{arg}{...}
verbs: \verb(*)...
environments: \begin{env}[opt]{arg}{...}...\end{env}
user-defined commands

Takeaways

Some commands may be missing, for example I found \text{} to be absent, to check which commands are supported by default, refer here: CorePackageDefinitions.java Thankfully, enough of the package is exposed to be able to inject new commands, like so for example: engine.getPackages().get(0).addComplexCommandOneArg("text", false, ALL_MODES,LR, StyleDeclarationInterpretation.NORMALSIZE, null, TextFlowContext.ALLOW_INLINE);

I have not checked if this is the correct way to represent the text command, but now it parses it correctly.

What it does not do

&
#
%

What we could do

Use the tokens provided by snuggletex to implement our own parser on top

Or
Keep our integrity checks for & and # and implement % like we used to

What I would like to do

I am really fascinated by this library, it's clean, well documented and build thoughfully and extendible. I'd really like to do more with it. If you do not mind, I'd like to port our integrity checks to snuggletex, rather than writing them as we used to.

Mandatory checks

[x] Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
[x] Tests created for changes (if applicable)
[x] Manually tested changed features in running JabRef (always required)
[ ] Screenshots added in PR description (for UI changes)
[ ] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
[ ] Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

koppor commented 1 year ago

In JabRef's localization, we keep the English keys equal to the English full text. In this way, we can have readabile code. Details at https://devdocs.jabref.org/code-howtos/localization.html.

Regarding the SnuggleTeX errors, it would be OK to show "LaTeX cannot be parsed" and have the detail message in English only. These are all very technical terms and I thnk, no JabRef tranlsator should be bothered to translate them.

koppor commented 1 year ago

To workaround the method-too-large exception, I am trying to include our custom JDK build.

koppor commented 1 year ago

Thank you for having worked on this. Sorry for the late reply. Hope, you will continue on this!

Some commands may be missing, for example I found \text{} to be absent,

\text appears in math-mode only. Not outside math mode. Maybe, it is available inside math mode (e..g, $\text{test}$).

I have not checked if this is the correct way to represent the text command, but now it parses it correctly.

Which concrete case did you have?

What it does not do

Please unzip what you mean.

Ah, I understand: It does not check for ampersands, wrong bibtex string concatenation, and percent signs.

For BibTeX string concatenation, we already discussed. Difficult thing as this is JabRef custom logic, too.

What we could do

1. Use the tokens provided by snuggletex to implement our own parser on top
   **Or**

Could be a fun excercise? With the possible outcome that the code is harder to read as our existing code?

What I would like to do

I am really fascinated by this library, it's clean, well documented and build thoughfully and extendible. I'd really like to do more with it. If you do not mind, I'd like to port our integrity checks to snuggletex, rather than writing them as we used to.

Suggestion: Finish this PR with the current functionality. - Create a new branch (based on this branch) to rewrite the other checks. In this way, you can work in parallel. If the updated code turns out to be good, we can continue working forward to include it. If it turns out it is not that maintainable, the issue https://github.com/JabRef/jabref/issues/8712 is still fixed.

Zylence commented 1 year ago

In JabRef's localization, we keep the English keys equal to the English full text. In this way, we can have readabile code. Details at https://devdocs.jabref.org/code-howtos/localization.html.

Regarding the SnuggleTeX errors, it would be OK to show "LaTeX cannot be parsed" and have the detail message in English only. These are all very technical terms and I thnk, no JabRef tranlsator should be bothered to translate them.

Okay, just to make sure I understand you correctly, a message would for example look like this:

LaTeX cannot be parsed: Nothing following \

And we would only translate this part: LaTeX cannot be parsed.

So having Nothing following \ inside the properties file is no necessary?

Consequently we would only need one entry inside the properties file: LaTeX\ cannot\ be\ parsed:=LaTeX cannot be parsed: %0

and "feed it" the internal error messages.

Zylence commented 1 year ago

To workaround the method-too-large exception, I am trying to include our custom JDK build.

Sounds great, I am looking forward to trying it out :^)

Zylence commented 1 year ago

Thank you for having worked on this. Sorry for the late reply. Hope, you will continue on this!

No worries, I was very busy anyway. I would be happy to continue :^) .

\text appears in math-mode only. Not outside math mode. Maybe, it is available inside math mode (e..g, ).

It really is not handled. I looked through the library with text search and tried it out. \text is an undefined command according to snuggletex.

Ah, I understand: It does not check for ampersands, wrong bibtex string concatenation, and percent signs.

Exactly, could have made this clearer, sorry for the confusion.

Could be a fun excercise? With the possible outcome that the code is harder to read as our existing code?

Just guessing, I'd say it won't make much of a difference. Perhaps we should just be happy with what we already have. I will do a case study regardless, so we can see the possible outcome before deciding.

Suggestion: Finish this PR with the current functionality. - Create a new branch (based on this branch) to rewrite the other checks. In this way, you can work in parallel. If the updated code turns out to be good, we can continue working forward to include it. If it turns out it is not that maintainable, the issue https://github.com/JabRef/jabref/issues/8712 is still fixed.

Sounds good to me, I'll do it. :)

koppor commented 1 year ago

Okay, just to make sure I understand you correctly, a message would for example look like this:

LaTeX cannot be parsed: Nothing following \

And we would only translate this part: LaTeX cannot be parsed.

So having Nothing following \ inside the properties file is no necessary?

Yes. We should move forward in that way. Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates...

Consequently we would only need one entry inside the properties file: LaTeX\ cannot\ be\ parsed:=LaTeX cannot be parsed: %0

The %0 needs to be on the left side, too. Key and value really need to be the same 😅

and "feed it" the internal error messages.

Yes!

koppor commented 1 year ago

Thank you for having worked on this. Sorry for the late reply. Hope, you will continue on this!

No worries, I was very busy anyway. I would be happy to continue :^) .

Looking forward 🤩

\text appears in math-mode only. Not outside math mode. Maybe, it is available inside math mode (e..g, ). It really is not handled. I looked through the library with text search and tried it out. \text is an undefined command according to snuggletex...

Maybe worth a PR for SnuggleTex? 😅

Could be a fun excercise? With the possible outcome that the code is harder to read as our existing code? Just guessing, I'd say it won't make much of a difference. Perhaps we should just be happy with what we already have. I will do a case study regardless, so we can see the possible outcome before deciding.

👍

Zylence commented 1 year ago

Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates...

Do you mean to translate them automaticly, via a service?

The %0 needs to be on the left side, too. Key and value really need to be the same 😅

Slip of the pen. 😅

Yes!

Errors are now prefixed with "LaTeX Parsing Error:" I found that quite appealing, but we can change that back to "LaTeX cannot be parsed" if you like.

Zylence commented 1 year ago

Maybe worth a PR for SnuggleTex? 😅

I can try, but do you think it will be merged? I saw your PR in the repo is still dangling around unmerged. 😅

Zylence commented 1 year ago

Please add test cases - and then this should be good to go for a review at the jabref repo

Tests will be ready over the next week. :)

Code looks pretty readable.

Thanks, just did some performance oriented refactoring regardless, but that should not have too big of an effect on readability. Engine and Session are now static members, prior to that we instantiated them per bib entry. (Keeping the references aroung only adds ~1 KB of memory overhead)

koppor commented 1 year ago

Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates... Do you mean to translate them automaticly, via a service?

No ^^. Here my line of thought:

The number error messages is greater than ten.
We do not know if we need to translate them all.
How to find out which messages should be translated?
Idea: Use telemtry! Collect the send error messages centrally.
After some time, we know the set of tuples (error message, number of occurrence)
These can then be translated.

The translation will be done as usual using JabRef_en.properties.

Errors are now prefixed with "LaTeX Parsing Error:" I found that quite appealing,

Reas good!

koppor commented 1 year ago

Maybe worth a PR for SnuggleTex? 😅 I can try, but do you think it will be merged? I saw your PR in the repo is still dangling around unmerged. 😅

@davemckain is the original developer on snuggletex. My bet is that he is happy if several people contribute to his repository - https://github.com/davemckain/snuggletex.

Note to self: The other "maintained" fork seems to be https://github.com/rototor/snuggletex. I would, however, like to stick to the "original" one ^^.

Zylence commented 11 months ago

Code looks pretty readable.

Please add test cases - and then this should be good to go for a review at the jabref repo

Test cases have been added in this commit. Some that I expected to work did not (due to snuggletex) these are commented out for now.

Zylence commented 11 months ago

Code looks pretty readable. Please add test cases - and then this should be good to go for a review at the jabref repo

Test cases have been added in this commit. Some that I expected to work did not (due to snuggletex) these are commented out for now.

Well that was unfortunate. I accidentally closed the PR in this comment, sorry.

Zylence commented 11 months ago

Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates... Do you mean to translate them automaticly, via a service?

No ^^. Here my line of thought:

The number error messages is greater than ten.

We do not know if we need to translate them all.

How to find out which messages should be translated?

Idea: Use telemtry! Collect the send error messages centrally.

After some time, we know the set of tuples (error message, number of occurrence)

These can then be translated.

The translation will be done as usual using JabRef_en.properties.

Errors are now prefixed with "LaTeX Parsing Error:" I found that quite appealing,

Reas good!

Oh, okay, now I get it - I did not know you had telemetry infrastructure for that kind of thing. Thanks for letting me know. I agree, narrowing down the selection of messages that need translation in advance is the better choice.

koppor commented 10 months ago

This should now be a PR to JabRef's main repo. I resolved the merge conflicts at https://github.com/koppor/jabref/commit/4fb512e76bb441c299702e3382a81e90789200ff.

Should go as one commit in the upstream repo - if possible.

Zylence commented 10 months ago

This should now be a PR to JabRef's main repo. I resolved the merge conflicts at 4fb512e.

Should go as one commit in the upstream repo - if possible.

Thank you, I am happy to hear that. I am excited to see how it will do in the wild! :d

koppor commented 9 months ago

Submitted https://github.com/JabRef/jabref/pull/10376 - therefore closing this.

koppor / jabref

Snuggletex Integration #646

Snuggletex

What it does do

Takeaways

What it does not do

What we could do

What I would like to do

Mandatory checks

What it does not do

What we could do

What I would like to do