overleaf / spelling

The backend spellcheck API that performs spell checking for Overleaf
GNU Affero General Public License v3.0
9 stars 17 forks source link

Multiple (less used) lang's not working #15

Closed henryoswald closed 3 years ago

henryoswald commented 5 years ago

The following languages are not working in aspell, this is separate to the docker/k8 implementation but was picked up while debugging a docker related issue.

It is not clear to me why these lang's don't work, the dictionaries have words in them:

> aspell -d am dump master  --encoding=utf-8 | wc -l
13740

The following is how to test:

echo "thsi is fulll of misatkesss in any langugge" | aspell --list -d am  --encoding=utf-8

I think it is likely they have been broken for years.

@jdleesmiller we could just cull these languages by removing them as options in the web interface? I am sure we could fix it with a bit dev time.

jdleesmiller commented 5 years ago

Yes, I think removing them from the web UI for now makes sense. I tried a few online, and they indeed don't seem to be working, so I don't think too many people will be disadvantaged if we remove them.

mans0954 commented 5 years ago

This may have something to do with it: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=769017

mans0954 commented 5 years ago

Hungarian - hu, Icelandic -is, Swazi - ss and Zulu - zu all appear to pass the test described by @henryoswald when it is run in the image built from https://github.com/sharelatex/spelling-sharelatex/pull/16 - presumably due to newer packages (node 6.16.0 is built on Debian 9).

(oriya - or, tamil - ta and telugu - te all return a message of the form Error: The file "/usr/lib/aspell/??.rws" is not in the proper format. Incompatible hash function. The rest return nothing.)

mans0954 commented 5 years ago

I think our methodology may be flawed here, in assuming that pumping Latin character sets into non-Latin spellcheckers will do something sensible. At least one of the dictionaries works if fed the correct character set. Build from source:

wget ftp://ftp.gnu.org/gnu/aspell/dict/am/aspell6-am-0.03-1.tar.bz2
tar xf aspell6-am-0.03-1.tar.bz2 
cd aspell6-am-0.03-1/
./configure
make
make install

Try Latin test

echo "thsi is fulll of misatkesss in any langugge" | aspell -a -d am  --encoding=utf-8
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)

Try native test:

precat am.cwl | head -n 5 | tr '\n' ' ' >> words.txt
precat am.cwl | tail -n 5 | tr -d '\n'  >> words.txt
precat am.cwl | head -n 5 | tr '\n' ' ' >> words.txt
 cat words.txt 
ሀረግ ሀብተ ሀብተማርያም ሀብተየስ ሀብቴ ፖሊቲከኛፖሊቲካፖምፓፖስተኛፖስታሀረግ ሀብተ ሀብተማርያም ሀብተየስ ሀብቴ 
cat words.txt | aspell -a -d am  --encoding=utf-8
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
*
*
*
*
*
# ፖሊቲከኛፖሊቲካፖምፓፖስተኛፖስታሀረግ 26
*
*
*
*

Which looks like it's working to me.

mans0954 commented 5 years ago

Something like the following script:

#!/bin/bash
CWL_DIR=/usr/share/aspell/
OUT_DIR=/tmp/tex

shopt -s extglob

mkdir $OUT_DIR

for cwl in $CWL_DIR/*.cwl.gz; do
  LANG_CODE=${cwl//+(*\/|.*)}
  TEX_FILE=${OUT_DIR}/main-${LANG_CODE}.tex
  echo ${TEX_FILE}
  rm -f ${TEX_FILE}
  echo "\documentclass{article}" >> ${TEX_FILE}
  echo "\begin{document}" >> ${TEX_FILE}

  zcat $cwl | precat | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
  zcat $cwl | precat | tail -n 5 | tr -d '\n'  >> ${TEX_FILE}
  zcat $cwl | precat | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
  echo "" >> ${TEX_FILE}
  echo "\end{document}" >> ${TEX_FILE}

  cat ${TEX_FILE} | aspell pipe -t --encoding=utf-8 -d ${LANG_CODE}

done

Can give a quick visual representation of which languages work and which are broken.

mans0954 commented 5 years ago

The deb files for aspell-or and aspell-te seem to have a genuine problem. Simply rebuilding the debs from source seems to be enough to fix it.

mans0954 commented 5 years ago

At the risk of having become slightly obsessed with this, here is a script which checks every dictionary on the system and marks them as OKAY or NOT OKAY

#!/bin/bash
CWL_DIR=/usr/share/aspell/
TMP_DIR=/tmp/wordlists
OUT_DIR=/tmp/tex

shopt -s extglob

mkdir $TMP_DIR $OUT_DIR 

#There's something odd about gl-minimos
DICTIONARIES=${1:-`aspell dicts | grep -v ^gl$`}
echo $DICTIONARIES

for DICTIONARY in $DICTIONARIES; do
  if [ "$DICTIONARY" = "gl-minimos" ]; then
    LANG="gl-minimos"
  elif [ "$DICTIONARY" = "pt_BR" ]; then
    LANG="pt_BR"
  elif [ "$DICTIONARY" = "pt_PT" ]; then
    LANG="pt_PT"
  else 
    TEMP=${DICTIONARY%%_*}
    LANG=${TEMP%%-*}
  fi
  WORDLIST=$TMP_DIR/$DICTIONARY-wordlist.txt
  aspell -d $DICTIONARY dump master | aspell -l $LANG expand | cut -d ' ' -f 1 | awk 'length>3' | head -n 500 > $WORDLIST
  TEX_FILE=${OUT_DIR}/main-${DICTIONARY}.tex
  rm -f ${TEX_FILE}
  echo "\documentclass{article}" >> ${TEX_FILE}
  echo "\begin{document}" >> ${TEX_FILE}

  cat $WORDLIST | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
  cat $WORDLIST | tail -n 5 | tr -d '\n'  >> ${TEX_FILE}
  echo " " | tr -d '\n' >> ${TEX_FILE}
  cat $WORDLIST | head -n 5 | tr '\n' ' ' >> ${TEX_FILE}
  echo "" >> ${TEX_FILE}
  echo "\end{document}" >> ${TEX_FILE}

  OUTPUT=`cat ${TEX_FILE} | aspell pipe -t  -d ${DICTIONARY} | tr -d '\n'`
  [[ $OUTPUT =~ \*{5}[#\&\?]\ .*\*{5} ]] && echo "$DICTIONARY is OKAY" || echo "$DICTIONARY is NOT OKAY"

done
mans0954 commented 5 years ago

The regexp doesn't seem to cut it on Bash 5, so I replaced it with:

  IFS=$'\n'
  OUTPUT=`cat ${TEX_FILE} | aspell pipe -t  -d ${DICTIONARY} | tr '*' '+'`
  echo $OUTPUT
  let i=0
  for line in $OUTPUT
  do
    let i++
    echo "$i ^$line^"
    if [ "$i" = 1 ]; then
      [[ $line =~ ^\@\(#\)\ International\ Ispell ]] || echo "FAIL"
    elif [ "$i" = 7 ]; then
      [[ $line =~ [#\&\?]\ .* ]] || echo "$DICTIONARY is NOT OKAY"
    else
      [[ $line =~ ^\+$ ]] || echo "FAIL"
    fi
  done
mans0954 commented 5 years ago

I've fixed aspell-te and aspell-or upstream in Debian:

https://tracker.debian.org/news/1029864/accepted-aspell-te-001-2-6-source-all-into-unstable/ https://tracker.debian.org/news/1029752/accepted-aspell-or-003-1-6-source-all-into-unstable/

I've pushed some commits onto https://github.com/sharelatex/spelling-sharelatex/pull/16 to pull in these packages, and to also pull no from buster as the stretch version seems to have a broken symlink issue.

I overlooked aspell-ta, so still need to fix that.

mans0954 commented 5 years ago

aspell-ta also done:

https://tracker.debian.org/news/1031726/accepted-aspell-ta-20040424-1-2-source-all-into-unstable/

lawshe commented 5 years ago

User has followed up again about Hungarian, https://app.frontapp.com/open/cnv_18for5x. I will let them know that we are working on this and will get back to them when it's fixed.

mserranom commented 5 years ago

Added finnish to the list

gh2k commented 5 years ago

User has followed-up again about Hungarian. https://app.frontapp.com/open/msg_3d64u69

JuneKelly commented 4 years ago

Same user following up again: https://app.frontapp.com/open/msg_ca3xiv5

hair-splitter commented 4 years ago

Dear Developers, I really miss the Hungarian spell checker in the editor of Overleaf. Is progress expected in this matter?

hair-splitter commented 3 years ago

I can’t believe you haven’t been able to fix this bug in almost two years:(

gh2k commented 3 years ago

Unfortunately, the upstream spell checking tool no longer supports Hungarian. We have some plans to rework the spelling system which will supersede the work here so I'm going to close this ticket in favour of that.