Managing DLMF formula dataset

HowardCohl commented 4 years ago

@physikerwelt @AndreG-P @abdouyoussef

I don't know if this is the right place to put this issue, but now that we have considered using Abdou's program to massage the DLMF data, you need to think about some of the processing steps that I used in generating the original dataset. Since you are starting with the XML, I suppose you don't need to worry about removing all comment lines, even those in formulae.

These are some things which you might need to think about:

removing or ignoring commas, etc. at end of equations;
globally replacing e->\expe, i->\iunit, \pi->\cpi (perhaps this information is already available in the XML, also there might be other replacements which are correctly handled);
separating out separate \pm and \pm` formulas into two separate formulas such as http://dlmf.nist.gov/10.15.E1;
splitting multiple formulas in {equationgroup} commands such as http://dlmf.nist.gov/10.6.E1;
splitting multiple formulas in {equationmix} commands such as http://dlmf.nist.gov/10.9.E18.

There might be other things I am missing.

physikerwelt commented 4 years ago

Yes. Let's discuss this here. I have been working on that the program and it worked very well so far.

splitting multiple formulas in {equationgroup} commands such as http://dlmf.nist.gov/10.6.E1;

Abdou had nice templates for that. I think this is done. Here are the first lines of the 10.6 file

\BesselC{\nu-1}@{z}+\BesselC{\nu+1}@{z}=(2\nu/z)\BesselC{\nu}@{z}, \url{https://dlmf.nist.gov/10.6#Ex1}

\BesselC{\nu-1}@{z}-\BesselC{\nu+1}@{z}=2\BesselC{\nu}'@{z}. \url{https://dlmf.nist.gov/10.6#Ex2}

\BesselC{\nu}'@{z}=\BesselC{\nu-1}@{z}-(\nu/z)\BesselC{\nu}@{z}, \url{https://dlmf.nist.gov/10.6#Ex3}

\BesselC{\nu}'@{z}=-\BesselC{\nu+1}@{z}+(\nu/z)\BesselC{\nu}@{z}. \url{https://dlmf.nist.gov/10.6#Ex4}

\displaystyle\BesselJ{0}'@{z}=-\BesselJ{1}@{z}, \url{https://dlmf.nist.gov/10.6#E3X} \comments{Warning: Part 1 of multicontent tex element;}

\displaystyle\BesselY{0}'@{z}=-\BesselY{1}@{z}, \url{https://dlmf.nist.gov/10.6#E3X} \comments{Warning: Part 2 of multicontent tex element;}

\displaystyle\HankelH{1}{0}'@{z}=-\HankelH{1}{1}@{z}, \url{https://dlmf.nist.gov/10.6#E3Xa} \comments{Warning: Part 1 of multicontent tex element;}

\displaystyle\HankelH{2}{0}'@{z}=-\HankelH{2}{1}@{z}. \url{https://dlmf.nist.gov/10.6#E3Xa} \comments{Warning: Part 2 of multicontent tex element;}

Overall there are 30 multicontent situations in the dataset. That means 30 formulae for which one can not generate an unambiguous deeplink to the DLMF.

splitting multiple formulas in {equationmix} commands such as http://dlmf.nist.gov/10.9.E18.

is handled as well


\HankelH{1}{\nu}@{z}=\frac{1}{\pi i}\int_{-\infty}^{\infty+\pi i}e^{z\sinh@@{t}-\nu t}\diff{t}, \url{https://dlmf.nist.gov/10.9#Ex7}

\HankelH{2}{\nu}@{z}=-\frac{1}{\pi i}\int_{-\infty}^{\infty-\pi i}e^{z\sinh@@{t}-\nu t}\diff{t}. \url{https://dlmf.nist.gov/10.9#Ex8}

I think these aspects should be handled within Andres program:

removing or ignoring commas, etc. at end of equations;

It should be done for all input and is no specific thing to the DLMF

globally replacing e->\expe, i->\iunit, \pi->\cpi (perhaps this information is already available in the XML, also there might be other replacements which are correctly handled);

XML does it when the DLMF editors decided to do so. We did not apply the DRMF heuristics. This would again be something I would see more as feature of Andres' program.

separating out separate \pm and \pm` formulas into two separate formulas such as http://dlmf.nist.gov/10.15.E1;

This should certainly be done within Andres' program. That way it is a feature of the program instead of a prerequisite. The implementation effort is the same either way.

AndreG-P commented 4 years ago

To update everybody

I think these aspects should be handled within Andres program:

removing or ignoring commas, etc. at end of equations;

It should be done for all input and is no specific thing to the DLMF

Yes, it is already implemented in the program and works well.

globally replacing e->\expe, i->\iunit, \pi->\cpi (perhaps this information is already available in the XML, also there might be other replacements which are correctly handled); XML does it when the DLMF editors decided to do so. We did not apply the DRMF heuristics. This would again be something I would see more as feature of Andres' program.

I disagree, replacing e by \expe is very context-dependent and is only true in the DLMF dataset. This is clearly a flaw in the DLMF data, especially when you see that some i are already given as \iunit but not all. I will not update the translator to handle that, instead, I replace these three cases manually in the test dataset.

separating out separate \pm and \pm` formulas into two separate formulas such as http://dlmf.nist.gov/10.15.E1;

This should certainly be done within Andres' program. That way it is a feature of the program instead of a prerequisite. The implementation effort is the same either way.

I agree and it is already included in the engine. However, the translator itself cannot handle \pm and I think it shouldn't. There is currently no case-by-case translation and there should be only one translation for one input. However, regarding the test set, I implemented to split the test cases into sub-cases.

AndreG-P commented 4 years ago

Besides that, I don't see a reason why the DLMF links should be unambiguous. I think it's fine if there are multiple tests referring to the same DLMF link.

@HowardCohl @physikerwelt Other problems are constraints and substitutions. I fixed some common substitutions manually (mainly for \zeta which is often used for substitution in the DLMF). Also, how about the constraints? As I mentioned in an e-mail to Howard, in some cases there are constraints just given in the infobox, e.g., k is an integer here but it is not explicitly given as a constraint.

Is this information included in the new data or not? Just one example: https://dlmf.nist.gov/4.21#E34 Here, n is an integer but there is no constraint in the dataset for this case. The data only contains:

\cos@{nz}+\iunit\sin@{nz}=(\cos@@{z}+\iunit\sin@@{z})^{n}. \url{http://dlmf.nist.gov/4.21.E34}

physikerwelt commented 4 years ago

I disagree, replacing e by \expe is very context-dependent and is only true in the DLMF dataset. This is clearly a flaw in the DLMF data, especially when you see that some i are already given as \iunit but not all. I will not update the translator to handle that, instead, I replace these three cases manually in the test dataset.

This is cheating. We can not publish a paper where we manually tune the dataset as we want. We could also skip the evaluation and just invent some numbers.

AndreG-P commented 4 years ago

I don't think so. e is always \expe in DLMF, as well as i is \iunit and \pi and \cpi. However, in other scenarios, outside of the DLMF, this is not true. The only reason why there is e and i and \pi in the DLMF is because the authors didn't use the semantic macros.

So what we do is not cheating, it is adding the missing information.

physikerwelt commented 4 years ago

Is this information included in the new data or not? Just one example: https://dlmf.nist.gov/4.21#E34

No. Not yet. However, we can add the symbols list.

Feel free to suggest changes to the PR https://github.com/abdouyoussef/MLP/pull/5

physikerwelt commented 4 years ago

So what we do is not cheating, it is adding the missing information.

which is cheating.

physikerwelt commented 4 years ago

We need to include this assumption to the program as option and be open about it.

physikerwelt commented 4 years ago

@HowardCohl I discussed the remaining issue with André on the phone. Could you please clarify why one doesn't replace e->\expe, i->\iunit, \pi->\cpi in the DLMF source.

HowardCohl commented 4 years ago

Bruce does do the replacement. There is a command at the top of the source which tell them to do this for certain chapters. I can tell you which chapters if you want. In fact, I definitely have to do this. I will do this later.

HowardCohl commented 4 years ago

These are the chapters where e->\expe, etc. replacements are specified in the DLMF source:

AI.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
AI.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
AI.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
AI.tex:\lxDeclare[replace=$\EulerConstant$]{$\gamma$}%  Euler's constant
AL.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
AL.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
AL.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
AS.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
AS.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
AS.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
BP.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
BP.tex:\lxDeclare[replace=$\expe$]{$e$}%
BP.tex:\lxDeclare[replace=$\iunit$]{$i$}%
BS.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
BS.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
BS.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
CH.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
CH.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
CH.tex:\lxDeclare[replace=$\EulerConstant$]{$\gamma$}%
CW.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
CW.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
EF.tex:\lxDeclare[replace=$\expe$]{$e$}%
EF.tex:\lxDeclare[replace=$\iunit$]{$i$}%
EF.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
EL.tex:\lxDeclare[replace=$\iunit$]{$i$}%
EL.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
ER.tex:\lxDeclare[replace=$\expe$]{$e$}%
ER.tex:\lxDeclare[replace=$\iunit$]{$i$}%
ER.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
EX.tex:\lxDeclare[replace=$\expe$]{$e$}%
EX.tex:\lxDeclare[replace=$\iunit$]{$i$}%
EX.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
FM.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
GA.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
GA.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
GA.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
GH.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
HE.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
HE.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
HY.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
HY.tex:\lxDeclare[replace=$\expe$]{$e$}%
IC.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
IC.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
IG.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
IG.tex:\lxDeclare[replace=$\iunit$]{$i$}%
IG.tex:\lxDeclare[replace=$\expe$]{$e$}%
JA.tex:\lxDeclare[replace=$\expe$]{$e$}%
JA.tex:\lxDeclare[replace=$\iunit$]{$i$}%
JA.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
JA.tex:\lxDeclare[replace=$\compellintKk@@{k}$]{$K$}%
JA.tex:\lxDeclare[replace=$\ccompellintKk@@{k}$]{$K'$}%
LA.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
LE.tex:\lxDeclare[replace=$\expe$]{$e$}%
LE.tex:\lxDeclare[replace=$\iunit$]{$i$}%
LE.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
MA.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
MA.tex:\lxDeclare[replace=$\expe$]{$e$}%
MT.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
MT.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
MT.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
NM.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
NM.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
NT.tex:\lxDeclare[replace=$\expe$]{$e$}%
OP.tex:\lxDeclare[replace=$\expe$]{$e$}%
OP.tex:\lxDeclare[replace=$\iunit$]{$i$}%
OP.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
PC.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
PC.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
PC.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e
PT.tex:\lxDeclare[replace=$\expe$]{$e$}%
PT.tex:\lxDeclare[replace=$\iunit$]{$i$}%
PT.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
QH.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
QH.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
ST.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
ST.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
ST.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
SW.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
SW.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
SW.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
TH.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
TH.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i
TH.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!!
TJ.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi
WE.tex:\lxDeclare[replace=$\expe$]{$e$}%
WE.tex:\lxDeclare[replace=$\iunit$]{$i$}%
WE.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
ZE.tex:\lxDeclare[replace=$\cpi$]{$\pi$}%
ZE.tex:\lxDeclare[replace=$\iunit$]{$i$}%
ZE.tex:\lxDeclare[replace=$\expe$]{$e$}%

HowardCohl commented 4 years ago

Note that there are two files where \gamma is replace by Euler's constant (see above).

AI.tex:\lxDeclare[replace=$\EulerConstant$]{$\gamma$}%  Euler's constant
CH.tex:\lxDeclare[replace=$\EulerConstant$]{$\gamma$}%

HowardCohl commented 4 years ago

@physikerwelt Why am I assigned to this task? I have no ability at the moment to change Andre's program.

physikerwelt commented 4 years ago

@HowardCohl thank you. The assignment indicates that we rely on you to make progress with this task. This indicates that the problem is not in the DLMF source but somewhere downstream. Either in LaTeXML or my addition to abdous program.

HowardCohl commented 4 years ago

@physikerwelt @AndreG-P @abdouyoussef

@HowardCohl thank you. The assignment indicates that we rely on you to make progress with this task. This indicates that the problem is not in the DLMF source but somewhere downstream. Either in LaTeXML or my addition to abdous program.

There is no problem. In fact, since in the metadata (and because of those \lxDeclare's) it should be clear that e is \expe, etc. The replacements should either be clear, or have already been accomplished.

physikerwelt commented 4 years ago

@HowardCohl I think there is no problem in the DLMF. But there is a problem in picking up this information during the generation of the dataset file which is used by @AndreG-P's program. Thus the problem should be fixed in my addition to Abdou's program, or in @abdouyoussef's program (cf. https://github.com/abdouyoussef/MLP/issues/6)

physikerwelt commented 4 years ago

I figured out that the program of @abdouyoussef works differently. It extracts the symbols used using the same mechanism a human would use who clicks on the ibox on the dlmf website. This seems to be legitimate. @HowardCohl please check https://dlmf.nist.gov/4.2#E7 The formula has an i but it is not referenced in the ibox. Could you explain to me, why this information is missing?

HowardCohl commented 4 years ago

@HowardCohl please check https://dlmf.nist.gov/4.2#E7 The formula has an i but it is not referenced in the ibox. Could you explain to me, why this information is missing?

I honestly don't know. I just sent an email to @brucemiller about this. I will let you know when he responds.

physikerwelt commented 4 years ago

Thank you. But do you think the i *should be linked?

HowardCohl commented 4 years ago

Thank you. But do you think the i *should be linked?

Seems like it should.

abdouyoussef commented 4 years ago

@physikerwelt is right, my code (at this point) extracts the same symbols defined and the symbols used as the ones included in the info box, no more and no less. At a later stage, when I complete the code (with machine learning and NLP stuff), I will be able to detect the missing definitions and "uses", and make them available in the dataset.

Note, BTW, I observed that there are a lot of equations where their info boxes do not include/link everything in the equations, either by design (?) or by omission. Hopefully I will succeed in writing ML/NLP code that will rectify that.

HowardCohl commented 4 years ago

@abdouyoussef

Note, BTW, I observed that there are a lot of equations where their info boxes do not include/link everything in the equations, either by design (?) or by omission.

Can you be precise?

abdouyoussef commented 4 years ago

Well, looking at Section 10.7, you see that i is not mentioned in the info box in any of the equations there, e.g., 10.7.2, 10.7.6, 10.7.7.

In Section 10.13: equations 10.13.1-8 info boxes have no mention of lambda, by design I bet, because lambda is defined in-line in the first line of that section.

Speaking more broadly, many constrained equations do not have all their constraints formally put inside the constraints part, but instead are mixed in with the text before or after the equation. Thus, as things stand, a dataset generated from the DLMF without the extra analysis that hunts for "missing" definitions and "missing" constraints is not a "complete" dataset. By "missing" I mean the entity is not specified formally in the info box or in the constraints portion (if the entity is a constraint).

A translator like what @ AndreG-P is developing needs a complete dataset, in the sense that all the entities and constraints of an equation have to be fully specified within the formal bounds of the equation, rather than be partially distributed across text, even if that text is nearby.

HowardCohl commented 4 years ago

Well, looking at Section 10.7, you see that i is not mentioned in the info box in any of the equations there, e.g., 10.7.2, 10.7.6, 10.7.7.

Yes, we already discussed i or \iunit. That is not in dispute. It seems to be missing.

In Section 10.13: equations 10.13.1-8 info boxes have no mention of lambda, by design I bet, because lambda is defined in-line in the first line of that section.

True. However, it does say in the text that \lambda is real or complex constant such that \lambda\ne 0. It does need to get in the metadata as well.

Speaking more broadly, many constrained equations do not have all their constraints formally put inside the constraints part, but instead are mixed in with the text before or after the equation.

This is true. In fact, I started an issue about this in 2015. https://github.com/usnistgov/dlmf/issues/4

Thus, as things stand, a dataset generated from the DLMF without the extra analysis that hunts for "missing" definitions and "missing" constraints is not a "complete" dataset. By "missing" I mean the entity is not specified formally in the info box or in the constraints portion (if the entity is a constraint).

However, it is in the text (with perhaps some exceptions which represent errata.).

A translator like what @ AndreG-P is developing needs a complete dataset, in the sense that all the entities and constraints of an equation have to be fully specified within the formal bounds of the equation, rather than be partially distributed across text, even if that text is nearby.

Good luck! :)

HowardCohl commented 4 years ago

@abdouyoussef

Are you able to output the missing symbol data somehow? e.g., what symbols are currently missing from the metadata?

This would be extremely useful and would be very useful to move this project forward.

physikerwelt commented 4 years ago

I vaguely remember that we could tune the symbols using the DLMF software. Unfortunately I can not look into the details due to https://github.com/usnistgov/dlmf/issues/99

abdouyoussef commented 4 years ago

@HowardCohl At some point I will be able to output the missing symbol data. It is on my agenda, but I can't promise it will happen any time soon because I a have a bunch of other commitments elsewhere and some holiday travels coming up. But I'll keep it on my list of priorities.

AndreG-P commented 4 years ago

As far as I can see, the problem is that all the information is there but often in different places (infobox, constraints and the text surrounding the formula). Especially the latter (surrounding text) is hard to capture.

@abdouyoussef I appreciate any progress, but is it realistic to plan with these improvements for the JCDL paper?

@physikerwelt @HowardCohl So the quintessence here is that the DLMF fully describes if e is \expe (as well as other replacements). Since this information is missing in my data, we should not implement any assumptions like e is \expe in the translator and instead, update the test data. Does everybody agree? (@physikerwelt I didn't forget our custom replacement feature, see #110)

If so, @physikerwelt can you update the test data and send me an updated version where e, i, \pi and \gamma are replaced in all cases that are defined by lxDeclare[replace...?

Besides that, just for clarification, the test data contains all information that is in the infobox or in the constraint (surrounding text is not captured), is that correct? As we discussed earlier, all this information should be given in \constraint{ . } in the test data.

@abdouyoussef @physikerwelt Also, to bring up the substitution problem again (see https://dlmf.nist.gov/9.6#E2): \zeta is linked in the infobox to https://dlmf.nist.gov/9.6#E1. So it should be somehow possible to capture this substitution as well in the dataset, right?

@HowardCohl I remember you somehow replaced k', for example in https://dlmf.nist.gov/22.2#p2. Was this information also given in \lxDeclare and can we perform this replacement now also? I also noticed that the prime comes after the arguments now in this section. I'm not sure, but I think in our old data two years ago the position of the primes where different. For example, it seems K(k)' is not defined now...

physikerwelt commented 4 years ago

@AndreG-P I am not sure if I can differentiate all the different issues you discuss in this ticket, but I will try my best, to address all aspects. Let me know if I missed something.

1) we should update the test dataset, but we can't do that at the moment. Since I have problems running the DLMF software. We need to figure out where the information is not propagated to the iboxes..

2) regarding 9.6.2. We have the following information https://github.com/abdouyoussef/MLP/blob/147ab2d54e98567690ec52e86374d0e29acfeaab/MathNLP/ReferenceData/Datasets/dlmf/dlmf-chapters-OneTextBlockPerEquation/9/9.6.txt#L25-L54 which can be exposed to the other data format. Would any of these lines help with your problem?

AndreG-P commented 4 years ago

@physikerwelt

@AndreG-P I am not sure if I can differentiate all the different issues you discuss in this ticket, but I will try my best, to address all aspects. Let me know if I missed something.

we should update the test dataset, but we can't do that at the moment. Since I have problems running the DLMF software. We need to figure out where the information is not propagated to the iboxes..

Yes. The new version should consider replacements, such as e to \expe, and constraints from iboxes and the actual constraint tags. (See also: https://github.com/abdouyoussef/MLP/pull/5#issuecomment-558538905)

The problem is to distinguish between domain definitions and other things in the infoboxes. For example, k: integer is important but k: modulus is not helpful and could even causing failure for the test case. Can we somehow at least include the domain specifications as constraints, such as z: complex? Also, this must be given in a mathematical equation and not in text form (not z: complex but z \in \Complex).

regarding 9.6.2. We have the following information https://github.com/abdouyoussef/MLP/blob/147ab2d54e98567690ec52e86374d0e29acfeaab/MathNLP/ReferenceData/Datasets/dlmf/dlmf-chapters-OneTextBlockPerEquation/9/9.6.txt#L25-L54 which can be exposed to the other data format. Would any of these lines help with your problem?

@abdouyoussef @physikerwelt It looks like there is no link to 9.6.1 anymore in this data. This makes it very difficult to substitute correctly. The problem are the other scenarios where "change of variable" appears.

Consider: https://dlmf.nist.gov/9.8#SS1.p3 Here, \xi is marked as a change of variable but the actual change is not given in our test dataset.

Consider also: https://dlmf.nist.gov/22.11#E1 Here \zeta is again a change of variable but it is linked even to another subsection.

Maybe, I just consider "change of variable (locally)". It's only a few cases but better than nothing.

physikerwelt commented 4 years ago

@AndreG-P as discussed here the provisionary files

Files...

``` 9.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 9.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 9.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 9.tex:\lxDeclare[replace=$\EulerConstant$]{$\gamma$}% Euler's constant 1.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 1.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 1.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 2.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 2.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 2.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 24.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 24.tex:\lxDeclare[replace=$\expe$]{$e$}% 24.tex:\lxDeclare[replace=$\iunit$]{$i$}% 10.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 10.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 10.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 13.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 13.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 13.tex:\lxDeclare[replace=$\EulerConstant$]{$\gamma$}% 33.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 33.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 4.tex:\lxDeclare[replace=$\expe$]{$e$}% 4.tex:\lxDeclare[replace=$\iunit$]{$i$}% 4.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 19.tex:\lxDeclare[replace=$\iunit$]{$i$}% 19.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 7.tex:\lxDeclare[replace=$\expe$]{$e$}% 7.tex:\lxDeclare[replace=$\iunit$]{$i$}% 7.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 6.tex:\lxDeclare[replace=$\expe$]{$e$}% 6.tex:\lxDeclare[replace=$\iunit$]{$i$}% 6.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 35.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 5.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 5.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 5.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 16.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 31.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 31.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 15.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 15.tex:\lxDeclare[replace=$\expe$]{$e$}% 36.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 36.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 8.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 8.tex:\lxDeclare[replace=$\iunit$]{$i$}% 8.tex:\lxDeclare[replace=$\expe$]{$e$}% 22.tex:\lxDeclare[replace=$\expe$]{$e$}% 22.tex:\lxDeclare[replace=$\iunit$]{$i$}% 22.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 22.tex:\lxDeclare[replace=$\compellintKk@@{k}$]{$K$}% 22.tex:\lxDeclare[replace=$\ccompellintKk@@{k}$]{$K'$}% 29.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 14.tex:\lxDeclare[replace=$\expe$]{$e$}% 14.tex:\lxDeclare[replace=$\iunit$]{$i$}% 14.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 28.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 28.tex:\lxDeclare[replace=$\expe$]{$e$}% 21.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 21.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 21.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 3.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 3.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 27.tex:\lxDeclare[replace=$\expe$]{$e$}% 18.tex:\lxDeclare[replace=$\expe$]{$e$}% 18.tex:\lxDeclare[replace=$\iunit$]{$i$}% 18.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 12.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 12.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 12.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e 32.tex:\lxDeclare[replace=$\expe$]{$e$}% 32.tex:\lxDeclare[replace=$\iunit$]{$i$}% 32.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 17.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 17.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 11.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 11.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 11.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 30.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 30.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 30.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 20.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 20.tex:\lxDeclare[replace=$\iunit$]{$i$}% Imaginary i 20.tex:\lxDeclare[replace=$\expe$]{$e$}% Exponential e, except with subscript!!! 34.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% Circular pi 23.tex:\lxDeclare[replace=$\expe$]{$e$}% 23.tex:\lxDeclare[replace=$\iunit$]{$i$}% 23.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 25.tex:\lxDeclare[replace=$\cpi$]{$\pi$}% 25.tex:\lxDeclare[replace=$\iunit$]{$i$}% 25.tex:\lxDeclare[replace=$\expe$]{$e$}% ``` sublime replacement command ``` [ { "caption": "Reg Replace: DLMF code2Num", "command": "reg_replace", "args": { "replacements": [ "replaceAL", "replaceAS", "replaceNM", "replaceEF", "replaceGA", "replaceEX", "replaceER", "replaceIG", "replaceAI", "replaceBS", "replaceST", "replacePC", "replaceCH", "replaceLE", "replaceHY", "replaceGH", "replaceQH", "replaceOP", "replaceEL", "replaceTH", "replaceMT", "replaceJA", "replaceWE", "replaceBP", "replaceZE", "replaceCM", "replaceNT", "replaceMA", "replaceLA", "replaceSW", "replaceHE", "replacePT", "replaceCW", "replaceTJ", "replaceFM", "replaceIC" ] } } ] { "replacements": { "replaceAL": {"find":"AL", "replace":"1"}, "replaceAS": {"find":"AS", "replace":"2"}, "replaceNM": {"find":"NM", "replace":"3"}, "replaceEF": {"find":"EF", "replace":"4"}, "replaceGA": {"find":"GA", "replace":"5"}, "replaceEX": {"find":"EX", "replace":"6"}, "replaceER": {"find":"ER", "replace":"7"}, "replaceIG": {"find":"IG", "replace":"8"}, "replaceAI": {"find":"AI", "replace":"9"}, "replaceBS": {"find":"BS", "replace":"10"}, "replaceST": {"find":"ST", "replace":"11"}, "replacePC": {"find":"PC", "replace":"12"}, "replaceCH": {"find":"CH", "replace":"13"}, "replaceLE": {"find":"LE", "replace":"14"}, "replaceHY": {"find":"HY", "replace":"15"}, "replaceGH": {"find":"GH", "replace":"16"}, "replaceQH": {"find":"QH", "replace":"17"}, "replaceOP": {"find":"OP", "replace":"18"}, "replaceEL": {"find":"EL", "replace":"19"}, "replaceTH": {"find":"TH", "replace":"20"}, "replaceMT": {"find":"MT", "replace":"21"}, "replaceJA": {"find":"JA", "replace":"22"}, "replaceWE": {"find":"WE", "replace":"23"}, "replaceBP": {"find":"BP", "replace":"24"}, "replaceZE": {"find":"ZE", "replace":"25"}, "replaceCM": {"find":"CM", "replace":"26"}, "replaceNT": {"find":"NT", "replace":"27"}, "replaceMA": {"find":"MA", "replace":"28"}, "replaceLA": {"find":"LA", "replace":"29"}, "replaceSW": {"find":"SW", "replace":"30"}, "replaceHE": {"find":"HE", "replace":"31"}, "replacePT": {"find":"PT", "replace":"32"}, "replaceCW": {"find":"CW", "replace":"33"}, "replaceTJ": {"find":"TJ", "replace":"34"}, "replaceFM": {"find":"FM", "replace":"35"}, "replaceIC": {"find":"IC", "replace":"36"}} } ``` inspiration https://css-tricks.com/run-multiple-find-replace-commands-sublime-text/

abdouyoussef commented 4 years ago

@AndreG-P To answer your question whether it is realistic to plan with these improvements for the JCDL paper, I'd say mostly likely no, for two reasons: (1) I will be off to other commitments for several weeks, and (2) the problem is still open-ended with no guarantee of 100%% success yet.

AndreG-P commented 4 years ago

@physikerwelt thank you.

@abdouyoussef I agree. This simply implies that we need to find the best way to work with the current data.

AndreG-P commented 4 years ago

@physikerwelt Thus I would say we only consider the cases where it states: change of variable (locally). Since these seems to be the only cases where a variable is clearly defined for the following expressions.

I think symbols-used: contains the information we are looking for. It contains: $$z$$: complex variable and $$\zeta(z)$$: change of variable (locally). So could you add maybe a \symbolUsed{...} to the test cases? I can ignore most of them but if I find something like $$z$$: complex variable I add it as a constraint (in this case z \in \Complex) and if I find change of variable (locally) I know I have consider a replacement in the following test cases of the same section.

physikerwelt commented 4 years ago

sure will do.

abdouyoussef commented 4 years ago

I modified the code so that:

The constraints are expressed both in LaTeX and Semantic-LaTeX. If no Semantic-LaTeX is provided by the source file, it is then not provided in the dataset
Each symbol (defined or used) has now a separate mini-block that has as fields: (1) Its TeX, (2) Semantic-TeX (if available), (3) the reference to its defining entry (which could be an equation or a table entry), and (4) its meaning.

Below is a sample. Let me know if you would like me to upload this modified dataset (over-riding the one currently posted).

Here is a sample of an equation-block in the dataset: Equation: equation-number: 10.7.4 permalink: http://dlmf.nist.gov/10.7.E4 xml-id: C10.S7.E4 tex: $$Y_{\nu}\left(z\right)\sim-(1/\pi)\Gamma\left(\nu\right)(\tfrac{1}{2}z)^{-\nu},$$ content-tex: $$\BesselY{\nu}@{z}\asympeq-(1/\pi)\EulerGamma@{\nu}(\tfrac{1}{2}z)^{-\nu},$$

constraints: tex: $$\Re\nu>0$$ or $$\nu=-\tfrac{1}{2},-\tfrac{3}{2},-\tfrac{5}{2},\ldots$$, content-tex: $$\realpart@@{\nu}>0$$ or $$\nu=-\tfrac{1}{2},-\tfrac{3}{2},-\tfrac{5}{2},\ldots$$,

symbols-used: symbol: tex: $$Y_{\NVar{\nu}}\left(\NVar{z}\right)$$ content-tex: $$\BesselY{\NVar{\nu}}@{\NVar{z}}$$ idref: C10.S2.E3 meaning: Bessel function of the second kind symbol: tex: $$\Gamma\left(\NVar{z}\right)$$ content-tex: $$\EulerGamma@{\NVar{z}}$$ idref: C5.S2.E1 meaning: gamma function symbol: tex: $$\sim$$ content-tex: $$\asympeq$$ idref: C2.S1.E1 meaning: asymptotic equality symbol: tex: $$\pi$$ content-tex: $$\cpi$$ idref: C3.S12.E1 meaning: the ratio of the circumference of a circle to its diameter symbol: tex: $$\Re$$ content-tex: $$\realpart@@$$ idref: C1.S9.E2 meaning: real part symbol: tex: $$z$$ idref: C10.S1.p2.t1.r4 meaning: complex variable symbol: tex: $$\nu$$ idref: C10.S1.p2.t1.r5 meaning: complex parameter

context: sentence-xmlid: C10.S7.SS1.p1.s1 sentence-num-in-section: 1 sentence-num-in-chapter: 109 sentence-num-in-corpus: 6508 para-xmlid: C10.S7.SS1.p1 para-num-of-sentences: 3 subsection-xmlid: C10.S7.SS1 subsection-title: $z\to 0$ section-xmlid: C10.S7 section-title: Limiting Forms chapter-xmlid: C10 chapter-title: Bessel Functions End-equation

abdouyoussef commented 4 years ago

Note that in the previous sample of an equation-block, the indentation was lost, but in the dataset, there is proper indentation.

HowardCohl commented 4 years ago

@AndreG-P

@HowardCohl I remember you somehow replaced k', for example in https://dlmf.nist.gov/22.2#p2. Was this information also given in \lxDeclare and can we perform this replacement now also? I also noticed that the prime comes after the arguments now in this section. I'm not sure, but I think in our old data two years ago the position of the primes where different. For example, it seems K(k)' is not defined now...

Good memory! Ok. This is what is going on:

In JA.tex there is the following global replacements.

JA.tex:\lxDeclare[replace=$\compellintKk@@{k}$]{$K$}%
JA.tex:\lxDeclare[replace=$\ccompellintKk@@{k}$]{$K'$}%

Also, in JA.tex, MA.tex, LA.tex, EL.tex (when you are in math mode)

{k'}^2 represents 1-k^2
{k'}^{2m} represents (1-k^2)^m
{k'}^{2m+2} represents (1-k^2)^{m+1}
k' represents \sqrt{1-k^2}

In JA.tex in §22.7(i) Descending Landen Transformation http://dlmf.nist.gov/22.7.i

k_1 represents \frac{1-k'}{1+k'}

In JA.tex in §22.7(ii) Ascending Landen Transformation http://dlmf.nist.gov/22.7.ii

k_2 represents \frac{2\sqrt{k}}{1+k}
k'_2 represents \frac{1-k}{1+k}

In JA.tex in §22.17(i) Real or Purely Imaginary Moduli:

k_1 which represents \frac{k}{\sqrt{1+k^2}}
k_1k'_1 which represents \frac{k}{1+k^2}
Hence k'_1 represents \frac{1}{1+k^2}

I am actually working right now on improving the linking for variables of this type. Maybe when I am done you can get the data from me?

I think everything else of this type is mostly encapsulated by the metadata in the i-boxes.

HowardCohl commented 4 years ago

2. regarding 9.6.2. We have the following information https://github.com/abdouyoussef/MLP/blob/147ab2d54e98567690ec52e86374d0e29acfeaab/MathNLP/ReferenceData/Datasets/dlmf/dlmf-chapters-OneTextBlockPerEquation/9/9.6.txt#L25-L54 which can be exposed to the other data format. Would any of these lines help with your problem?

@physikerwelt What happened to the semantic LaTeX in that link?

HowardCohl commented 4 years ago

@AndreG-P To answer your question whether it is realistic to plan with these improvements for the JCDL paper, I'd say mostly likely no, for two reasons: (1) I will be off to other commitments for several weeks, and (2) the problem is still open-ended with no guarantee of 100%% success yet.

Perhaps we should have stuck with my implementation.

AndreG-P commented 4 years ago

Perhaps we should have stuck with my implementation.

@HowardCohl Yes, perhaps... But it looked like we don't have a chance to update your extractions and it would be better to move to Abdou's data. That was Moritz's initial motivation.

@physikerwelt What do you think? If we have a chance to update Howard's program, it might be better?

AndreG-P commented 4 years ago

@HowardCohl

I am actually working right now on improving the linking for variables of this type. Maybe when I am done you can get the data from me?

I think there is still a bug in your code. When you send me formulas-3.txt there are weird artifacts in the data. For example Line 3723 contains:

\compellint\CompEllIntKk@@{k}k@{k}

I think this is related to your k' replacements.

Besides that, I would agree to maybe use your data. What do you think about your schedule? When do you plan to have a good version of the dataset?

HowardCohl commented 4 years ago

Perhaps we should have stuck with my implementation.

@HowardCohl Yes, perhaps... But it looked like we don't have a chance to update your extractions and it would be better to move to Abdou's data. That was Moritz's initial motivation.

@physikerwelt What do you think? If we have a chance to update Howard's program, it might be better?

Of course I can update it. That is easy. In fact, I already do all the replacements. Clearly Abdou's program is better in the long run, but in the short run, I don't know. But I can easily help and am ready to help.

HowardCohl commented 4 years ago

@HowardCohl

I am actually working right now on improving the linking for variables of this type. Maybe when I am done you can get the data from me?

I think there is still a bug in your code. When you send me formulas-3.txt there are weird artifacts in the data. For example Line 3723 contains:
\compellint\CompEllIntKk@@{k}k@{k}
I think this is related to your k' replacements.

Besides that, I would agree to maybe use your data. What do you think about your schedule? When do you plan to have a good version of the dataset?

I can easily look at this tomorrow. I just thought since we were using Abdou's program, there was no point. Just let me know. Perhaps it wouldn't be too bad to have two alternative datasets each with pluses and minus. Clearly mine has some minuses. :)

AndreG-P commented 4 years ago

@HowardCohl @abdouyoussef @physikerwelt I had quite a long discussion with Moritz and we think the best option is a hybrid approach of both datasets. We will use Abdou's dataset but manually define some replacement rules in extra config files. These replacement rules can be grouped into 3 three categories

the list of lxDeclare replacements in entire sections (e.g., e => \expe) defined in the list that Howard posted above
replace all \zeta in DLMF 9.6 and related sections
replace all k, k', k_1 and so on as discussed above.

Hence, we still using the better long term approach and rely on Abdou's data, but quick and dirty fixing some of the most prominent problems. We believe that the best solution to finish everything for the JCDL.

Furthermore, I will only evaluate expressions that have content LaTeX (e.g., there is at least one semantic macro in the expression). This seems to be a very effective approach to filter out functions that are kind of meaningless for evaluation via CAS.

I will work on implementing all this stuff in the following days. The symbolic evaluation works on Maple and Mathematica. The numerical tests are not yet updated to work with Mathematica. I will also work on that now. I hope to finish all this by the end of next week already.

physikerwelt commented 4 years ago

@AndreG-P I updated the dataset. Now it contains the list of symbols used and symbols defined. The second argument is a unique ID. In some cases, the id does not link to the definition directly but can be massaged to link to the definition. For example C9.S6.XMD1.m1adec references C9.S6.XMD1.m1dec one could now consider resolving this links (by omitting the last block of the id). However, the hard part will be the following: If one knows that for example \zeta was defined in the formula \zeta=\tfrac{2}{3}z^{3/2}, how can one the replacement rule for \zeta? One can either (as @abdouyoussef suggested earlier) translate the whole expression as an assumption and pass it to the simplify Mathematica command or develop heuristics to extract the definitions. Also, note that https://github.com/physikerwelt/MLP/blob/eqLine2/MathNLP/ReferenceData/Datasets/dlmf/dlmf-chapters-OneLinePerEquation/9/9.6.txt#L41 is very likely to a bug in the DLMF.

physikerwelt commented 4 years ago

@abdouyoussef I think it would be great if you could share the new code and dataset. The old dataset will still be available from the git history.

AndreG-P commented 4 years ago

@physikerwelt

If one knows that, for example, \zeta was defined in the formula \zeta=\tfrac{2}{3}z^{3/2}, how can one the replacement rule for \zeta? One can either (as @abdouyoussef suggested earlier) translate the whole expression as an assumption and pass it to the simplify Mathematica command or develop heuristics to extract the definitions.

This is of course a good idea, the problem is, that it is very hard to generalize among multiple CAS and that it doesn't work in assumptions. For example:

sin(1/z) - sin(x)

This cannot be simplified by Maple/Mathematica unless I define

z := 1/x; (Maple and Mathematica).

The following does not work! Neither in Maple nor in Mathematica:

simplify( sin(1/z) - sin(x) ) assuming z == x/1; 
FullSimplify[ Sin[Divide[1,z]] - Sin[x], z == Divide[1,z] ]

However, if I define z := 1/x, it becomes critical to unset z again.

Anyway, it is way more easy to perform the replacements on the strings.

abdouyoussef commented 4 years ago

I uploaded two compressed files: dlmf-chapters-OneTextBlockPerEquation-detailed.zip, and dlmf-chapters-OneTextBlockPerMathExpr-detailed.zip.

The first zip file contains the dlmf files consisting of equation blocks, where the constraints are in both LateX and semantic-LateX (where available), and the symbols defined/used are more detailed; for each symbol, there is a mini-block titled "symbol:" and has several lines showing the tex representation, content-tex representation (if available), the idref (i.e., the DLMF ID of the equation (or table entry) that defines that symbol, and the meaning of that symbol.

The second zip file has not only the same equation blocks as in the previous zip file, but also math-expression blocks.

abdouyoussef commented 4 years ago

I also updated the software, especially the file that Moritz created for generating equations, one line per equation.

gipplab / LaCASt

Managing DLMF formula dataset #109