Closed henryoswald closed 3 years ago
Yes, I think removing them from the web UI for now makes sense. I tried a few online, and they indeed don't seem to be working, so I don't think too many people will be disadvantaged if we remove them.
This may have something to do with it: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=769017
Hungarian - hu, Icelandic -is, Swazi - ss and Zulu - zu all appear to pass the test described by @henryoswald when it is run in the image built from https://github.com/sharelatex/spelling-sharelatex/pull/16 - presumably due to newer packages (node 6.16.0 is built on Debian 9).
(oriya - or, tamil - ta and telugu - te all return a message of the form Error: The file "/usr/lib/aspell/??.rws" is not in the proper format. Incompatible hash function.
The rest return nothing.)
I think our methodology may be flawed here, in assuming that pumping Latin character sets into non-Latin spellcheckers will do something sensible. At least one of the dictionaries works if fed the correct character set. Build from source:
wget ftp://ftp.gnu.org/gnu/aspell/dict/am/aspell6-am-0.03-1.tar.bz2
tar xf aspell6-am-0.03-1.tar.bz2
cd aspell6-am-0.03-1/
./configure
make
make install
Try Latin test
echo "thsi is fulll of misatkesss in any langugge" | aspell -a -d am --encoding=utf-8
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
Try native test:
precat am.cwl | head -n 5 | tr '\n' ' ' >> words.txt
precat am.cwl | tail -n 5 | tr -d '\n' >> words.txt
precat am.cwl | head -n 5 | tr '\n' ' ' >> words.txt
cat words.txt
ሀረግ ሀብተ ሀብተማርያም ሀብተየስ ሀብቴ ፖሊቲከኛፖሊቲካፖምፓፖስተኛፖስታሀረግ ሀብተ ሀብተማርያም ሀብተየስ ሀብቴ
cat words.txt | aspell -a -d am --encoding=utf-8
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
*
*
*
*
*
# ፖሊቲከኛፖሊቲካፖምፓፖስተኛፖስታሀረግ 26
*
*
*
*
Which looks like it's working to me.
Something like the following script:
#!/bin/bash
CWL_DIR=/usr/share/aspell/
OUT_DIR=/tmp/tex
shopt -s extglob
mkdir $OUT_DIR
for cwl in $CWL_DIR/*.cwl.gz; do
LANG_CODE=${cwl//+(*\/|.*)}
TEX_FILE=${OUT_DIR}/main-${LANG_CODE}.tex
echo ${TEX_FILE}
rm -f ${TEX_FILE}
echo "\documentclass{article}" >> ${TEX_FILE}
echo "\begin{document}" >> ${TEX_FILE}
zcat $cwl | precat | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
zcat $cwl | precat | tail -n 5 | tr -d '\n' >> ${TEX_FILE}
zcat $cwl | precat | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
echo "" >> ${TEX_FILE}
echo "\end{document}" >> ${TEX_FILE}
cat ${TEX_FILE} | aspell pipe -t --encoding=utf-8 -d ${LANG_CODE}
done
Can give a quick visual representation of which languages work and which are broken.
The deb files for aspell-or
and aspell-te
seem to have a genuine problem. Simply rebuilding the debs from source seems to be enough to fix it.
At the risk of having become slightly obsessed with this, here is a script which checks every dictionary on the system and marks them as OKAY
or NOT OKAY
#!/bin/bash
CWL_DIR=/usr/share/aspell/
TMP_DIR=/tmp/wordlists
OUT_DIR=/tmp/tex
shopt -s extglob
mkdir $TMP_DIR $OUT_DIR
#There's something odd about gl-minimos
DICTIONARIES=${1:-`aspell dicts | grep -v ^gl$`}
echo $DICTIONARIES
for DICTIONARY in $DICTIONARIES; do
if [ "$DICTIONARY" = "gl-minimos" ]; then
LANG="gl-minimos"
elif [ "$DICTIONARY" = "pt_BR" ]; then
LANG="pt_BR"
elif [ "$DICTIONARY" = "pt_PT" ]; then
LANG="pt_PT"
else
TEMP=${DICTIONARY%%_*}
LANG=${TEMP%%-*}
fi
WORDLIST=$TMP_DIR/$DICTIONARY-wordlist.txt
aspell -d $DICTIONARY dump master | aspell -l $LANG expand | cut -d ' ' -f 1 | awk 'length>3' | head -n 500 > $WORDLIST
TEX_FILE=${OUT_DIR}/main-${DICTIONARY}.tex
rm -f ${TEX_FILE}
echo "\documentclass{article}" >> ${TEX_FILE}
echo "\begin{document}" >> ${TEX_FILE}
cat $WORDLIST | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
cat $WORDLIST | tail -n 5 | tr -d '\n' >> ${TEX_FILE}
echo " " | tr -d '\n' >> ${TEX_FILE}
cat $WORDLIST | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
echo "" >> ${TEX_FILE}
echo "\end{document}" >> ${TEX_FILE}
OUTPUT=`cat ${TEX_FILE} | aspell pipe -t -d ${DICTIONARY} | tr -d '\n'`
[[ $OUTPUT =~ \*{5}[#\&\?]\ .*\*{5} ]] && echo "$DICTIONARY is OKAY" || echo "$DICTIONARY is NOT OKAY"
done
The regexp doesn't seem to cut it on Bash 5, so I replaced it with:
IFS=$'\n'
OUTPUT=`cat ${TEX_FILE} | aspell pipe -t -d ${DICTIONARY} | tr '*' '+'`
echo $OUTPUT
let i=0
for line in $OUTPUT
do
let i++
echo "$i ^$line^"
if [ "$i" = 1 ]; then
[[ $line =~ ^\@\(#\)\ International\ Ispell ]] || echo "FAIL"
elif [ "$i" = 7 ]; then
[[ $line =~ [#\&\?]\ .* ]] || echo "$DICTIONARY is NOT OKAY"
else
[[ $line =~ ^\+$ ]] || echo "FAIL"
fi
done
I've fixed aspell-te and aspell-or upstream in Debian:
https://tracker.debian.org/news/1029864/accepted-aspell-te-001-2-6-source-all-into-unstable/ https://tracker.debian.org/news/1029752/accepted-aspell-or-003-1-6-source-all-into-unstable/
I've pushed some commits onto https://github.com/sharelatex/spelling-sharelatex/pull/16 to pull in these packages, and to also pull no from buster as the stretch version seems to have a broken symlink issue.
I overlooked aspell-ta, so still need to fix that.
User has followed up again about Hungarian, https://app.frontapp.com/open/cnv_18for5x. I will let them know that we are working on this and will get back to them when it's fixed.
Added finnish to the list
User has followed-up again about Hungarian. https://app.frontapp.com/open/msg_3d64u69
Same user following up again: https://app.frontapp.com/open/msg_ca3xiv5
Dear Developers, I really miss the Hungarian spell checker in the editor of Overleaf. Is progress expected in this matter?
I can’t believe you haven’t been able to fix this bug in almost two years:(
Unfortunately, the upstream spell checking tool no longer supports Hungarian. We have some plans to rework the spelling system which will supersede the work here so I'm going to close this ticket in favour of that.
The following languages are not working in aspell, this is separate to the docker/k8 implementation but was picked up while debugging a docker related issue.
It is not clear to me why these lang's don't work, the dictionaries have words in them:
The following is how to test:
I think it is likely they have been broken for years.
@jdleesmiller we could just cull these languages by removing them as options in the web interface? I am sure we could fix it with a bit dev time.