New quality check and cleanup for &

koppor commented 2 years ago

Addition to https://github.com/JabRef/jabref/issues/8948

[x] Quality check for presence of unescaped & -- current user documentation: https://docs.jabref.org/finding-sorting-and-cleaning-entries/checkintegrity
[x] Cleanup action for escaping & (when unescaped) -- current user documentation: https://docs.jabref.org/finding-sorting-and-cleaning-entries/cleanupentries

KonstantinEger commented 1 year ago

Hi, seems like this is still free to take. We would like to take this as a first issue and post a PR soon :)

koppor commented 1 year ago

Thank you for choosing JabRef! Looking forward to your PR! - Would it be possible to tell us which course recommended JabRef? 🤩 - Here or via email (which you find on my profile)

As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at.

Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.

Siedlerchr commented 1 year ago

Quality check is now implemented: https://github.com/JabRef/jabref/pull/9758

Zylence commented 1 year ago

Regarding the Cleanup action, would you like the UI component as checkbox or entry inside 'field formatters'? (We currently use a checkbox for testing purposes)

koppor commented 1 year ago

This is a part of https://github.com/JabRef/jabref/issues/8712. When working on this, maybe completely fix https://github.com/JabRef/jabref/issues/8712?

Zylence commented 1 year ago

Hello, looking into this, I just realized that the cleanup action for unescaped ampersands is already implemented in jabref. So I am definitely up for a new task. :) I have to add though that I will not be able to work as much on it the next weeks - so if this is a time critical task, someone else should take it.

As far as I understand we need to add missing integrity checks for %, _, $ to warn the user if they are not escaped. We could easily do that extracting the ampersand check as a template.

We additionally need to decide whether we alter the integrity check for # to be usable with the template. As of right now it functions differently from the & check "firing" only if there is an uneven amount of unescaped #. I guess when implementing it, it was assumed that the user rather wants the formatting functionality than the sign.

Further, we need cleanup actions for % and # (whereas the # cleanup action should only be implemented if the integrity check follows the same pattern as the ampersand integrity check does: assuming the user wants plain signs over formatter functionalities).

And as far as the integrity check for table 2 symbols goes. Do we check the latex commands or input unicode chars? I mean: Do we warn the user if he uses a symbol that's correctly displayed in jabref but not in latex (when inputting unicode chars), such as '○' or vice versa? Or do we check the latex commands for spelling mistakes? (I think I did not fully understand this part)

Siedlerchr commented 1 year ago

Be careful with the "#" sign. The # is used for BibTex-Strings! https://docs.jabref.org/advanced/strings

koppor commented 1 year ago

As far as I understand we need to add missing integrity checks for %, _, $ to warn the user if they are not escaped. We could easily do that extracting the ampersand check as a template.

While you are talking about, it is more complicated:

%: Yes, there needs to be an escape check!

$ can be an indicator of math formulas. Can you try the following: If $ appear in pairs, and there is a space before the first $ and a space after the second $ (or beginning/end of the string), it is ok?

Example: Title with a math formula: $x=x+1$.

Now, the thing with _ gets even more difficult: It is perfectly valid in math formulas as follows: $a_2+b_2=c_2$ .

Maybe, this detailed parsing is too much.

Idea: A LaTeX parser should be used to check if the title parses correctly. SnuggleTeX (Example: https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/java/uk/ac/ed/ph/snuggletex/samples/MinimalExample.java) seems to be good. A fork of it can be retrieved from maven central: https://github.com/rototor/snuggletex. Note: Playing around with that will take some time... In the long run, this could replace our latex2unicode converter. -- In the short run, it should only be checked if the content can be parsed using LaTeX.

We additionally need to decide whether we alter the integrity check for # to be usable with the template. As of right now it functions differently from the & check "firing" only if there is an uneven amount of unescaped #. I guess when implementing it, it was assumed that the user rather wants the formatting functionality than the sign.

This is a hard one. As Christoph stated: # is typically used to concatenate strings: https://docs.jabref.org/advanced/strings. The current entry editor (and all the logic code) has no good support for that. The "magic" is happening at reading/writing a field. Not sure whether you really want to dive into that. -- Maybe it's as simple as checking the entry editor content similar to the $ above: # must appear in pairs.

Further, we need cleanup actions for % and # (whereas the # cleanup action should only be implemented if the integrity check follows the same pattern as the ampersand integrity check does: assuming the user wants plain signs over formatter functionalities).

Cleanup for % is good.

No straight-forward cleanup for # possible. One could search for defined strings. If a matched string is found, then corrected. Example: User inputs #kopp and #kubovy, which is corrected to #kopp# and #kubovy#, because kopp and kubovy are defined strings. Maybe, its also possible to just correct without checking the strings. Needs some more thougt/testing.

And as far as the integrity check for table 2 symbols goes. Do we check the latex commands or input unicode chars? I mean: Do we warn the user if he uses a symbol that's correctly displayed in jabref but not in latex (when inputting unicode chars), such as '○' or vice versa? Or do we check the latex commands for spelling mistakes? (I think I did not fully understand this part)

I would leave that part as is. No checking whatever.

I could neverthless try to understand it: Take an arbitrary command of the table. If it renders in JabRef correctly: No worries. If it does not render correctly: Add it to a check. -- However, this heavily depends on the latex2unicode converter. Instead of checking for things it cannot handle, the converter should be made better (or replaced by SnuggleTeX as outlined above). -- In other words: The user should not adapt their bibtex library because JabRef cannot handle it.

Zylence commented 1 year ago

Thank you for the detailed response. I will look into it and try out snuggletex this weekend.

$ can be an indicator of math formulas. Can you try the following: If $ appear in pairs, and there is a space before the first $ and a space after the second $ (or beginning/end of the string), it is ok?

In case we decide against snuggletex and need to check it manually, we can surely do that. But why check for spaces at the beginning and end? As far as I know they will not effect the functionality and the user may just leave them out. In any case, we can use the BibStringChecker (#) for orientation, it already implements the desired behaviour (just without the checks for spaces).

Zylence commented 1 year ago

Be careful with the "#" sign. The # is used for BibTex-Strings! https://docs.jabref.org/advanced/strings

I am sorry, initially I missunderstood the task: I thought the task was simply to warn the users whenever they use latex special characters, but now I realized that this would be very restricting and that the real task is to check whether it's valid latex to guarantee a pain free workflow for the user when integrating the bib files into his own documents.

Siedlerchr commented 1 year ago

It's a mix of both. But I would start with the low hanging fruits:

a) Check and b) cleanup for % and & ( I think % is already there -> LatexCleanup)
Check for balanced $ (e.g. always pairs)
Check if underscore is only inside a group of $$ (can be probaly done easily with regex)
If an underscore is outside $$ offer a cleanup either escaping it with \ or replacing it with \textunderscore

For the moment I would not consider # and the more complex cases. That should be fine.

Zylence commented 1 year ago

It's a mix of both. But I would start with the low hanging fruits:

a) Check and b) cleanup for % and & ( I think % is already there -> LatexCleanup)

Check for balanced $ (e.g. always pairs)

Check if underscore is only inside a group of $$ (can be probaly done easily with regex)

If an underscore is outside $$ offer a cleanup either escaping it with \ or replacing it with \textunderscore

For the moment I would not consider # and the more complex cases. That should be fine.

Okay, but I would like to use a proper parser like snuggletex "from the get go", since the code I write without using it would need to be replaced eventually anyways. Further I just now had some time to try it out and really like it! :-)

I have already tried some integration testing with snuggletex in jabref - it does not seem to immediately break any Tests:

Before: before

After: after

Using snuggletex it could be as simple as this (basicly takes care of all integrity checks):

SnuggleEngine engine = new SnuggleEngine();
SnuggleSession session = engine.createSession();
SnuggleInput input = new SnuggleInput(field.getValue()); //field is a BibEntryField
session.parseInput(input);
if (!session.getErrors().isEmpty()) {
    ErrorCode code = session.getErrors().get(0).getErrorCode();
     // filter only for tokenization errors
    if (code.getErrorGroup().equals(CoreErrorGroup.TTE)) { 
        //todo use custom translation
        String message = CorePackageDefinitions.getPackage().getErrorMessageBundle().getString(code.getName()); 
        //todo "handle" parsing error error
        session.getErrors().clear();
    }
}

This is jsut a mock and has to be improved but just for fun I edited one of the IntegrityChecks to use Snuggletex, seems promising:

works_as_expected

We'd need to copy and slightly adjust the libraries properties file into our own for custom translation, but that's no big deal. And Since CoreErrorCode is exposed, we could also easily filter for specific errors if we did not want to catch them all.

If you want to try how snuggletex integrates in jabref yourself, you can add these lines to the build.gradle:

implementation "de.rototor.snuggletex:snuggletex:1.3.0"
implementation ("de.rototor.snuggletex:snuggletex-jeuclid:1.3.0") {
    exclude group: "org.apache.xmlgraphics"
}

and this line to module-info.java:

requires snuggletex.core;

If you @koppor , @Siedlerchr and the other maintainers allow it, I'd like to integrate it. That way, the integrity checks are easily implemented, and we do not rely on handwritten comparisons or checking edge cases, but instead use a proper parser. The only thing left to do would be cleanup actions.

PS: I have not checked if all functions of it work, but tokenization and parsing seem fine.

Siedlerchr commented 1 year ago

Thanks for the investigation. We can try, but I think we run into problems with jlink + jpackage and the method too large error. You can try this locally, run a gradlew jpackage and see if it works. https://bugs.openjdk.org/browse/JDK-8240567

If this doesn't work we need your "manual" fix.

Create a branch with the snuggletex implementation (if this fails due to the jdk issue we can still keep the code and put it here at koppor's repo)
Create a branch without the external dependency reliance

koppor commented 1 year ago

The error messages are really great! Looking forward to see it integrated.

Regarding the jlink issue, we need to spend energy again to get https://github.com/openjdk/jdk/pull/10704 done.

Zylence commented 1 year ago

If this doesn't work we need your "manual" fix.

It did indeed fail, unfortunately. :(

I see you are working on this fix since quite a while, just out of curiosity when do you hope to have this resolved?

Regarding the snuggletex implementation, unfortunately it can not be used to handle all our use cases (for example it can not check for # out of the box) - but we have access to the parsed tokens which means we can implement our own functionality on top of that. (This also raises the question, whether we should consider rewriting the cleanup actions based on that.)

Create a branch with the snuggletex implementation (if this fails due to the jdk issue we can still keep the code and put it here at koppor's repo)

I am currently working on getting it cleaned up (as of now it's barely a hack).

Create a branch without the external dependency reliance

Just started doing checks for _, %, $ and cleanup for $, I am aiming to get this ready over the upcoming holidays (~2 weeks)

koppor commented 1 year ago

No worries regarding method too large. More motivation to get it fixed.

Regarding #, maybe it's a SnuggleTex limitation? Is it able to parse field = a # b # {text}?

Zylence commented 1 year ago

Is it able to parse field = a # b # {text}

No matter the constellation, if there is at least one # in the text, Snuggletex does treat it as error TTEG04, with this message: "Argument placeholder tokens (e.g. # 1) may only appear in command and environment definitions".

It can parse it, but it treats # differently than we do. We can easily work around this since we have access to the tokens. For starters, since the # integrity check is already implemented, we may as well just exclude TTEG04.

koppor commented 1 year ago

Excluding TTEG04 is OK for me.

Can you file an issue at https://github.com/davemckain/snuggletex/issues?

Zylence commented 1 year ago

Can you file an issue at https://github.com/davemckain/snuggletex/issues?

I've looked into the library, there is no bibtex reference anywhere. It does not seem like it was ever ment to support including / validating bibtex files. It's build just for plain latex and hence only allows # in "inside command/environment definitions" LaTeXTokeniser.java.

Their website states: "SnuggleTeX’s parser pretends that TeX never happened and may behave slightly differently to what experienced LaTeX users might expect." And since this is a bibtex and no latex feature, I do not think this change should be made to the source, nor that the maintainer would do it.

Should I still file the issue?

Zylence commented 1 year ago

Just created a PR with what is ready so far.

It is by no means perfect and does break some unit tests (because of localization). But its functional. The error messages are ported from the original snuggletex repo, we may need to rewrite them and can probably exclude some.

koppor commented 1 year ago

Ah, thank you, I was too quick in writing my thought. THank you for digging deeper!

You must not pass the field value directly to SnuggleTeX.

The handling of # is solved not very well in JabRef.

See test cases for examples

https://github.com/JabRef/jabref/blob/2ee5c9cf3fbb095edf71d6d9044400188ebe0c0b/src/test/java/org/jabref/logic/importer/fileformat/BibtexParserTest.java#L640
org.jabref.logic.importer.fileformat.BibtexParserTest#parseRecognizesStringAndEntry

Seeing the code of org.jabref.logic.bibtex.FieldWriter#formatAndResolveStrings, you should just ignore checking if the field value contains #.

Future work can provide a more proper data model for a BibTeX field

ThiloteE commented 1 year ago

"method too large" problem is fixed in the current development version.

Zylence commented 1 year ago

"method too large" problem is fixed in the current development version.

Thats good news! The requested test cases in the koppor repo for the snuggletex integration where also added. So I guess we are good to go?

koppor commented 1 year ago

I tried the escape ampersands:

Works!

Siedlerchr commented 1 year ago

But we need to disable it for file field

koppor / jabref

New quality check and cleanup for & #585